Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions bare_metal.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ We are using a few bare metal nodes for some critical services.

They have f.q.d.n. assigned by us and are collected [here](https://github.com/usegalaxy-eu/infrastructure/blob/4e22b02395c1bb8872ebda711cc123968ae8589f/dns.tf#L102)

`sn06.galaxyproject.eu` is the Galaxy machine
`sn09.galaxyproject.eu` is the Galaxy machine

`sn05.galaxyproject.eu` is the PostgreSQL and HTCcondor central manager machine
`sn11.galaxyproject.eu` is the PostgreSQL server

`build.galaxyproject.eu` is the Jenkins master machine
`build.galaxyproject.eu` is the Jenkins master machine and the HTCondor central manager

`zfs1.galaxyproject.eu` is a ZFS server

Expand Down
4 changes: 2 additions & 2 deletions clean_DATA_FETCH.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Given that script:

```python3
galaxy@sn06:/data/dnb02/galaxy_db$ cat /tmp/get_wd.py
galaxy@sn09:/data/dnb02/galaxy_db$ cat /tmp/get_wd.py
#!/usr/bin/python

import os, sys, shutil
Expand Down Expand Up @@ -41,7 +41,7 @@ cat /tmp/top_120.txt | xargs -i /tmp/get_wd.py {}

4. reassign handlers - change from celery to the real stuff
```bash
cat /tmp/top_120.txt | xargs -i gxadmin mutate reassign-job-to-handler {} handler_sn06_0 --commit
cat /tmp/top_120.txt | xargs -i gxadmin mutate reassign-job-to-handler {} handler_sn09_0 --commit
```

5. restart the job
Expand Down
6 changes: 3 additions & 3 deletions cloud/compare-clouds-and-condor.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Compare VMs in 2+ clouds and HTCondor
Simply replace the Openstack credentials with yours, maybe change the sn06 hostname, voilà.
Simply replace the Openstack credentials with yours, maybe change the sn09 hostname, voilà.
~~~
#!/bin/bash
date=$(date '+%Y-%m-%d')

ssh sn06 "condor_status --compact | grep vgcnbwc-worker | cut -d '.' -f1 > /tmp/$(date '+%Y-%m-%d')-condor"
ssh sn09 "condor_status --compact | grep vgcnbwc-worker | cut -d '.' -f1 > /tmp/$(date '+%Y-%m-%d')-condor"

scp sn06:/tmp/$date-condor /tmp/$date-condor
scp sn09:/tmp/$date-condor /tmp/$date-condor

# replace with yours
source ~/app-cred-Mira-openrc.sh
Expand Down
2 changes: 1 addition & 1 deletion cloud/services.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ title: Galaxy Europe Services
# TIaaS

- [Admin web interface](https://usegalaxy.eu/tiaas/admin/login/?next=/tiaas/admin/)
- service lives on [sn06](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/4e6121da8af500dfe878c312243be49807ac5f48/sn06.yml#L152)
- service lives on [sn09](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/be8d196b26f46852bc593a0d8a64e66dedde69c5/sn09.yml#L369)
- Deployed with the [usegalaxy_eu.tiaas2](https://github.com/galaxyproject/ansible-tiaas2) Ansible role using this [vars](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/group_vars/tiaas.yml)

# Grafana
Expand Down
4 changes: 2 additions & 2 deletions head_maintenance_nodes.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
## Ansible roles in the [infrastructure-playbook](https://github.com/usegalaxy-eu/infrastructure-playbook) repository

* The following are the roles that are currently being installed on the head and maintenance nodes via the [sn06 playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn06.yml), [sn07 playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn07.yml), and [maintenance node playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/maintenance.yml)
* The following are the roles that are currently being installed on the head and maintenance nodes via the [sn09 playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn09.yml), [sn07 playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn07.yml), and [maintenance node playbook](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/maintenance.yml)
* The roles are classified as either head node only, maintenance node only, or both
* Head nodes: are the nodes that are running the Galaxy web server, the Galaxy job handlers, and the Galaxy workflow schedulers. As of 15/02/2023 `sn06.galaxyproject.eu`, and `sn07.galaxyproject.eu` are the two head nodes. Only `sn06` is in production.
* Head nodes: are the nodes that are running the Galaxy web server, the Galaxy job handlers, and the Galaxy workflow schedulers. As of 27/07/2025 `sn09.galaxyproject.eu` is the only head node and `sn10` will be added later. Only `sn09` is in production.
* Maintenance node: runs cron jobs, contains Galaxy codebase, config, etc, pushes data to influxdb, performs cleanup tasks, syncs Galaxy codebase to NFS, etc.


Expand Down
10 changes: 6 additions & 4 deletions infrastructure_playbook_repo.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The playbooks are:

* [apollo.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/apollo.yml): [Apollo](https://genomearchitect.readthedocs.io/en/latest/) is a genome annotation web-based editor. This can be accessed through Galaxy to view, edit, and annotate genomes. Uses Tomcat, therefore run on a separate server. Additional information can be found [here](https://github.com/usegalaxy-eu/operations/blob/f8062472110116a4ddf3035c3d43374443ec235e/cloud/services.md#apollo)
* [beacon.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/beacon.yml): [Beacon](https://beacon-project.io/) is a service that allows to query for the presence of specific variants in a given dataset. We provide this service as part of our Galaxy EU instance. This is run on a [VM](https://github.com/usegalaxy-eu/infrastructure/blob/main/instance_dedicated_beacon.tf) on the cloud.
* [build.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/build.yml): This playbook is used to setup our [Jenkins server](https://build.galaxyproject.eu/).
* [build.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/build.yml): This playbook is used to setup our [Jenkins server](https://build.galaxyproject.eu/). This is also our HTCondor central manager server. (This server and playbooks will be deprecated, on the account of new `sn12` server and a playbook will be added later).
* [celery.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/celery.yml): This playbook is used to setup the [Celery](https://docs.celeryq.dev/en/stable/) node(s) that are used by Galaxy for running various jobs. For more information refer to this [doc](https://github.com/usegalaxy-eu/operations/blob/main/celery.md) and our [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/celery/slides.html#1)
* [cvmfs.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/cvmfs.yml): This playbook is used to setup the [CernVM-FS](https://cvmfs.readthedocs.io/en/stable/cpt-repo.html) server that is used by Galaxy to serve the reference data. Refer to these training materials for more information: [Reference Data with CVMFS](https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs/tutorial.html), and [Reference Data with CVMFS without Ansible](https://training.galaxyproject.org/training-material/topics/admin/tutorials/cvmfs-manual/tutorial.html)
* [galaxy-test.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/galaxy-test.yml): This playbook is used to setup the Galaxy test instance. This is used to perform tests on the Galaxy codebase before deploying it to the main Galaxy instance.
Expand All @@ -38,10 +38,12 @@ The playbooks are:
* [influxdb.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/influxdb.yml): This playbook is used to setup the [InfluxDB](https://www.influxdata.com/) server that is used to store the metrics that are collected by Telegraf. Refer to [this](https://github.com/usegalaxy-eu/operations/blob/main/influxdb.md) and [this](https://github.com/usegalaxy-eu/operations/blob/main/cloud/services.md#influxdb) doc. Here is the [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/monitoring/tutorial.html#influxdb)
* [mq.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/mq.yml): This playbook is used to setup the [RabbitMQ](https://www.rabbitmq.com/) server that is used by Galaxy. Refer [here](https://github.com/usegalaxy-eu/operations/blob/f8062472110116a4ddf3035c3d43374443ec235e/cloud/services.md#rabbitmq) and [here](https://github.com/usegalaxy-eu/operations/blob/main/celery.md#chapter-one-down-the-rabbit-hole) for details.
* [plausible.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/plausible.yml): This playbook is used to setup the [Plausible](https://plausible.io/) server that is used to collect the [analytics](https://stats.galaxyproject.eu/) for our Galaxy instance.
* [sn05.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn05.yml): This playbook is used to setup the Galaxy PostgreSQL database server and also the [HTCondor](https://htcondor.org/) cluster manager.
* [sn06.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn06.yml): This playbook configures the [Galaxy server](https://usegalaxy.eu). This is the main Galaxy server that is used by the users. This we denote as `headnode 1`. Refer to this [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html) to set up Galaxy.
* [sn05.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn05.yml): This playbook is used to setup the Galaxy PostgreSQL database server. (deprecated, use [sn11.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn11.yml) instead)
* [sn06.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn06.yml): This playbook configures the [Galaxy server](https://usegalaxy.eu). This is the main Galaxy server that is used by the users. This we denote as `headnode 1`. Refer to this [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html) to set up Galaxy. (deprecated, use [sn09.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn09.yml) instead)
* [sn07.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn07.yml): This playbook also configures the `galaxy server` but this is not in production (for now 22/02/2023). This we denote as `headnode 2`. Refer to this [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html) to set up Galaxy.
* [syn-to-nfs.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sync-to-nfs.yml): This playbook is used to sync the data of the Galaxy codebase on `headnode 1 (sn06)` to a NFS server. This is then synced to all nodes that needs the up-to-date Galaxy codebase and configuration files.
* [sn09.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn09.yml): This playbook configures the [Galaxy server](https://usegalaxy.eu). This is the main Galaxy server that is used by the users. This we denote as `headnode 1`. Refer to this [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/ansible-galaxy/tutorial.html) to set up Galaxy.
* [sn11.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sn11.yml): This playbook is used to setup the Galaxy PostgreSQL database server.
* [syn-to-nfs.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/sync-to-nfs.yml): This playbook is used to sync the data of the Galaxy codebase on `headnode 1 (sn09)` to a NFS server. This is then synced to all nodes that needs the up-to-date Galaxy codebase and configuration files.
* [telescope.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/telescope.yml): This playbook is used to setup the [Galactic Radio Telescope](https://github.com/hexylena/galactic-radio-telescope)
* [upload.yml](https://github.com/usegalaxy-eu/infrastructure-playbook/blob/master/upload.yml): This playbook sets up the [TUS](https://tus.io/) server that is used to upload data to Galaxy. Refer to this [training material](https://training.galaxyproject.org/training-material/topics/admin/tutorials/tus/tutorial.html) to set up TUS.

Expand Down
2 changes: 1 addition & 1 deletion jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ condor_q -autoformat ClusterId Cmd JobDescription RemoteHost JobStartDate | awk
Helped to solve
- https://github.com/usegalaxy-eu/issues/issues/504
~~~
gxadmin query q "select job.id from job inner join job_state_history jh on job.id = jh.job_id where job.handler = 'handler_sn06_0' and job.tool_id != '__DATA_FETCH__' and ( job.update_time between timestamp '2023-12-14 11:00:00' and '2023-12-14 12:00:00' )" | awk '{print$1}' | sort | uniq -c | sort -sn
gxadmin query q "select job.id from job inner join job_state_history jh on job.id = jh.job_id where job.handler = 'handler_sn09_0' and job.tool_id != '__DATA_FETCH__' and ( job.update_time between timestamp '2023-12-14 11:00:00' and '2023-12-14 12:00:00' )" | awk '{print$1}' | sort | uniq -c | sort -sn
~~~

### Show all jobs from PXE test nodes
Expand Down
24 changes: 12 additions & 12 deletions notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@ https://github.com/galaxyproject/galaxy-hub/blob/master/content/bare/eu/usegalax

# decode Galaxy id
```
user@sn06:~$ . /opt/galaxy/venv/bin/activate
(venv) user@sn06:~$ cd /opt/galaxy
(venv) user@sn06:/opt/galaxy$ python server/scripts/secret_decoder_ring.py decode ec81bbe85ee13506
user@sn09:~$ . /opt/galaxy/venv/bin/activate
(venv) user@sn09:~$ cd /opt/galaxy
(venv) user@sn09:/opt/galaxy$ python server/scripts/secret_decoder_ring.py decode ec81bbe85ee13506
746380
```
or using gxadmin
```
user@sn06:~$ . /opt/galaxy/venv/bin/activate
(venv) user@sn06:~$ GALAXY_ROOT=/opt/galaxy/server GALAXY_CONFIG_FILE=/opt/galaxy/config/galaxy.yml gxadmin galaxy decode ec81bbe85ee13506
user@sn09:~$ . /opt/galaxy/venv/bin/activate
(venv) user@sn09:~$ GALAXY_ROOT=/opt/galaxy/server GALAXY_CONFIG_FILE=/opt/galaxy/config/galaxy.yml gxadmin galaxy decode ec81bbe85ee13506
746380
```

Expand Down Expand Up @@ -43,9 +43,9 @@ print(tf.test.is_built_with_cuda()); print( tf.test.is_gpu_available())
Check the utilization of the GPU on the host system:

```console
> nvidia-smi
> nvidia-smi

Sun Aug 25 22:41:01 2019
Sun Aug 25 22:41:01 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
Expand All @@ -64,7 +64,7 @@ Sun Aug 25 22:41:01 2019
| 3 Tesla T4 Off | 00000000:00:08.0 Off | 0 |
| N/A 43C P0 26W / 70W | 0MiB / 15079MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
Expand Down Expand Up @@ -103,7 +103,7 @@ root@build:~$ grep -A1 vault-pass-usegalaxy-star /opt/jenkins/jenkins/jobs/usega
go to jenkins → manage jenkins → script console
https://build.galaxyproject.eu/script

google "jenkins decrypt secret" because you can never remember
google "jenkins decrypt secret" because you can never remember
println(hudson.util.Secret.fromString("{supersecretstringhere}").getPlainText())

4. done!
Expand All @@ -113,7 +113,7 @@ println(hudson.util.Secret.fromString("{supersecretstringhere}").getPlainText())
Find jenkins.war, in our case its at `/usr/share/java`. Rename the `jenkins.war` file:

```bash
/usr/share/java$ mv jenkins.war jenkins.war_2.375
/usr/share/java$ mv jenkins.war jenkins.war_2.375
```

Get older Jenkins version and restart.
Expand Down Expand Up @@ -141,14 +141,14 @@ https://gist.github.com/gmauro/cc97ff1287282469ce98c2b8035100f2

# debug 'D' state in processe

Get all processes in D state:
Get all processes in D state:

> ps axl | awk '$10 ~ /D/'

Looking at file handlers of a thread yields to:

```
root@sn06:~$ ll /proc/215503/task/296960/fd/**
root@sn09:~$ ll /proc/215503/task/296960/fd/**
lr-x------ 1 galaxy galaxy 64 Aug 2 17:09 /proc/215503/task/296960/fd/0 -> /dev/null
lrwx------ 1 galaxy galaxy 64 Aug 2 17:09 /proc/215503/task/296960/fd/1 -> 'socket:[3487300049]'
lr-x------ 1 galaxy galaxy 64 Aug 2 17:09 /proc/215503/task/296960/fd/10 -> /data/jwd01/main/048/946/48946081
Expand Down
6 changes: 3 additions & 3 deletions roles-without-repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
## that we carry around to fix or monitor things and that do not have an own repo/readme.md
### usegalaxy_eu.fs_maintenance
#### cron job cleanup-scrips submit to HTCondor (this one has a repo but no description)
This is scheduled as condor job, so I commented it for now, because we can also schedule this from sn06
This is scheduled as condor job, so I commented it for now, because we can also schedule this from sn09
and as soon as we have HTCondor on sn07 running we could uncomment it, because it will check the condor queue for running jobs before it reschedules them.
The other two cron jobs are a `docker purge` and `gxadmin cleanup` we most likely dont need docker anymore (also commented out) and can run `gxadmin cleanup` only on one node, because it will lead to conflicts otherwise.
### usegalaxy-eu.monitoring
Expand All @@ -23,10 +23,10 @@ This should only run on one node to avoid strange behaviour. It could be migrate
### usegalaxy-eu.unscheduled jobs/workflows
This was a fix for a galaxy bug that should be upstream now. However if we would need this, it needs to run on **both headnodes**, because it uses the handler logs to grep for 'failure running job'
### usegalaxy-eu.fix-ancient-ftp-data
This creates email-named folders to store ftp data and cleans up afterwards. I don't really now if we need this in the future.
This creates email-named folders to store ftp data and cleans up afterwards. I don't really now if we need this in the future.
**Clear is that this should not run on more than one node.**
### usegalaxy-eu.galaxy-procstat
Gathers information about the `galaxy-xxxx@*.services` so it can run on **both headnodes**.
Gathers information about the `galaxy-xxxx@*.services` so it can run on **both headnodes**.
Gunicorn will replace zergling in a later commit.
### usegalaxy-eu.fix-missing-api-keys
Creates API keys for all users automatically. This was once needed for the deprecated InteractiveEnvironments (now InteractiveTools). We can **remove this from both headnodes**
Expand Down
Loading