Skip to content

Conversation

@leafty
Copy link
Member

@leafty leafty commented Aug 21, 2025

Closes #983.

Feature branch

Add support for remote sessions on HPC clusters.

  • Add the location field with values local (default, current session behavior) and remote (starts sessions on remote compute resources)
  • Add the remote-session-controller to the sidecar containers
    • The remote-session-controller can start HPC sessions using the FirecREST API
    • This feature can be expanded in the future to support more external compute resources (e.g. cloud providers)
  • Add the wstunnel to the sidecar containers
    • The wstunnel allows the remote session to connect to the Amalthea session resources and allows traffic from the user to be routed to the remote session frontend via the session ingress

More details have been added to new.README.md.

Contents:

leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Aug 22, 2025
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Aug 22, 2025
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch from 3701ad5 to a041bdc Compare September 1, 2025 11:26
leafty added a commit that referenced this pull request Sep 2, 2025
This changes the k8s resource name for sessions to be `HpcAmaltheaSession`. It is done to allow for experimenting with the session CRD without impacting parallel work on sessions.

This commit should be removed or reverted before the feature PR #984 is merged.
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Sep 3, 2025
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Sep 3, 2025
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Sep 5, 2025
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Sep 16, 2025
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Sep 18, 2025
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Sep 29, 2025
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch 2 times, most recently from 3571de7 to b839361 Compare September 29, 2025 12:40
leafty added a commit that referenced this pull request Sep 30, 2025
This changes the k8s resource name for sessions to be `HpcAmaltheaSession`. It is done to allow for experimenting with the session CRD without impacting parallel work on sessions.

This commit should be removed or reverted before the feature PR #984 is merged.
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch from 4070212 to 2eea81e Compare September 30, 2025 07:43
leafty added a commit that referenced this pull request Sep 30, 2025
This changes the k8s resource name for sessions to be `HpcAmaltheaSession`. It is done to allow for experimenting with the session CRD without impacting parallel work on sessions.

This commit should be removed or reverted before the feature PR #984 is merged.
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch from 2eea81e to a87a3cb Compare September 30, 2025 14:23
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 6, 2025
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 6, 2025
leafty and others added 4 commits October 8, 2025 08:24
This changes the k8s resource name for sessions to be `HpcAmaltheaSession`. It is done to allow for experimenting with the session CRD without impacting parallel work on sessions.

This commit should be removed or reverted before the feature PR #984 is merged.
This change adds a new `location` field on the Amalthea session CRD which has two accepted values:
* `local`: the interactive session process runs inside the session pod
* `remote`: the interactive session process runs remotely and is controlled from the session pod
    Remote sessions are first implemented to support running sessions in HPC environments, though this can be generalized to many environment types.

Only the `location` field is added, no further change is contained here.
* experimental: remote sessions

* update

* fix types

* error formatting

* remove 'not implemented error'

* exp: use a dev name for sessions

* update

* fix e2e?

* revert non-important changes

* rerun some make targets

* feat: support remote sessions on HPC clusters

Closes #983.

_Feature branch_

* exp: use a dev name for sessions

* more updates

* feat: support remote sessions on HPC clusters

Closes #983.

_Feature branch_

* feat: install wstunnel in sidecars

* feat: add tunnel using wstunnel and os.exec.Command

* feat: add tunnel command to sidecars

* feat: basic testing of tunnel in sidecars

* refactor: use TARGETOS and TARGETARCH instead of WSTUNNEL_PLATFORM

* update

* fix e2e?

* feat: support remote sessions on HPC clusters

Closes #983.

_Feature branch_

* feat: support remote sessions on HPC clusters

Closes #983.

_Feature branch_

* Revert chartpress e2e leftovers

---------

Co-authored-by: Flora Thiebaut <flora.thiebaut@sdsc.ethz.ch>
leafty and others added 6 commits October 8, 2025 08:24
Add the remote session controller sidecar command and start it in the amalthea session.
Runs the tunnel container in remote sessions and setup the HPC job to connect to it. This allows remote HPC sessions to start and have their frontend accessible.

Co-authored-by: Salim Kayal <salim.kayal@idiap.ch>
* fix: ensure NVIDIA_VISIBLE_DEVICES is set to void for enroot on eiger

* squashme: cosmetic space

Co-authored-by: Samuel Gaist <samuel.gaist@idiap.ch>

---------

Co-authored-by: Samuel Gaist <samuel.gaist@idiap.ch>
Handles git repositories for remote sessions.

1. The git repositories are collected in the remote session controller from the `RENKU_WORKING_DIR` folder
2. The git repositories are configured in the remote session job

---------

Co-authored-by: Salim Kayal <salim.kayal@idiap.ch>
…controller (#1005)

Improvements on the remote session controller and remote session:
* Improve handling of user-defined environment variables -> use the prefix `USER_ENV_` so that the remote session controller can handle them in a robust way
* Handle the system name and partition parameters
* Use the `RSC_` prefix for to configure the remote session controller from env vars
* Use the project slug to determine the session path at the HPC cluster, e.g. `$SCRATCH/renku/sessions/flora.thiebaut/demo-hpc/flora-thieba-37d8d415cbe9` -> this makes it easier for users to find their files offline (using `ssh` on the HPC cluster)
* Write the session script before starting it -> allows users to understand how HPC sessions work
* Temporary fix: revert to `wstunnel v10.1.10` on arch64 nodes (issue with memory allocator)

---------

Co-authored-by: Salim Kayal <salim.kayal@idiap.ch>
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch from ee826b4 to 2bb97fd Compare October 8, 2025 06:24
Mount scratch, project and home directories based on the system response from the FirecREST API.

Also, attempt to save and rescue the session if the container is killed and restarted. Session recovery may not be successful, as killing the remote session controller may result in the remote job being cancelled. Though this should help recover the session if the remote session controller goes out of memory.

---------

Co-authored-by: Salim Kayal <salim.kayal@idiap.ch>
@leafty leafty force-pushed the build/support-remote-sessions-hpc branch from 2bb97fd to 558c331 Compare October 8, 2025 06:29
Undo all changes related to using "HpcAmaltheaSession" for development.
@leafty leafty marked this pull request as ready for review October 8, 2025 06:34
@leafty leafty requested review from a team and olevski as code owners October 8, 2025 06:34
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Oct 8, 2025
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 8, 2025
@leafty leafty enabled auto-merge (squash) October 8, 2025 07:57
@leafty leafty disabled auto-merge October 8, 2025 08:03
@leafty leafty enabled auto-merge (squash) October 8, 2025 08:03
@olevski olevski disabled auto-merge October 8, 2025 09:17
@olevski olevski merged commit 782eba4 into main Oct 8, 2025
52 of 57 checks passed
@olevski olevski deleted the build/support-remote-sessions-hpc branch October 8, 2025 09:31
leafty added a commit to SwissDataScienceCenter/renku-data-services that referenced this pull request Oct 8, 2025
Add support for remote sessions on HPC clusters. See SwissDataScienceCenter/amalthea#984 for related changes in Amalthea.

See also: SwissDataScienceCenter/amalthea#984

* Add a new `remote` field on Resource Pools which is not set for local resource pools (existing behavior) and can be set to contain the configuration to start remote sessions. When the `remote` field is set, the resource pool will start Amalthea sessions with the `location` field set to `remote`.
* Handle setting configuration and other bits to support launching remote sessions.
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 8, 2025
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 8, 2025
leafty added a commit to SwissDataScienceCenter/renku that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ShapeUp] Support remote sessions on HPC clusters

4 participants