feat: idle-shutdown of workspaces #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

aws-prayags wants to merge 4 commits into jupyter-infra:main from aws-prayags:feat/idle-shutdown1

+1,036 −140

Contributor

aws-prayags commented Oct 27, 2025

Add Idle Shutdown Support for Workspace and WorkspaceTemplates

Templates can define default idle shutdown settings settings that workspaces inherit or override within configurable bounds.
Workspaces automatically stop after the configured idle period.
Templates can enforce policies by disabling overrides or setting timeout limits.

Future items for idle-shutdown - #84

Other changes

Fixed validation loop where template failures retried indefinitely instead of stopping reconciliation
(Temp - to check with team) Removed XValidation rule that failed on legitimate finalizer updates

Testing

Tested template inheritance, workspace overrides, and policy enforcement
Verified idle detection works for both JupyterLab and CodeEditor applications in sagemaker-distribution images
(Tests yet to be added, will be added in this PR)

aws-prayags force-pushed the feat/idle-shutdown1 branch 5 times, most recently from 14d665b to a875077 Compare

October 27, 2025 18:40

aws-prayags marked this pull request as ready for review

October 27, 2025 18:47


          feat: add support for idle-shutdown of workspaces

1b63898

aws-prayags force-pushed the feat/idle-shutdown1 branch from a875077 to 1b63898 Compare

October 27, 2025 18:55

JGuinegagne changed the title ~~feat: add support for idle-shutdown of workspaces~~ feat: idle-shutdown of workspaces

JGuinegagne reviewed

View reviewed changes

Contributor

JGuinegagne left a comment

Few high-level comments:

pipe through the command as exec attribute of the idleShutdownConfig (following Probe precedent
look at probe interface, and evaluate if we should reuse the interface
if no reuse, then evaluate which attributes are relevant

api/v1alpha1/workspace_types.go Outdated

    
              	// EndpointCheck specifies HTTP endpoint to check for idle status

              	// +optional

              	EndpointCheck *EndpointCheckSpec `json:"endpointCheck,omitempty"`

              }

Contributor

JGuinegagne Oct 27, 2025

non-blocking: should there be a HealthCheckIntervalMinutes attribute or similar?

Consider following the precedent of ReadinessProbe:

kubectl explain pod.spec.containers.readinessProbe

Contributor Author

aws-prayags Oct 27, 2025

I did consider this - my preference is to start with a reasonable default ( in this PR it's 5 mins ) and add in support for HealthCheckIntervalMinutes in future if needed.

api/v1alpha1/workspace_types.go Outdated Show resolved Hide resolved

config/samples/idle-shutdown/templates/02-jupyter-template.yaml Outdated Show resolved Hide resolved

internal/controller/idle_checker.go

    
              	}

              }

              // CheckWorkspaceIdle checks if a workspace is idle using the configured detection method

Contributor

JGuinegagne Oct 27, 2025

can you add to the comment what the bool, bool return type means?

Consider creating a struct for better readability/maintainability.

Contributor Author

aws-prayags Oct 28, 2025

added an IdleCheckResult struct

internal/controller/idle_checker.go Outdated

    
              		return false, true, fmt.Errorf("failed to find workspace pod: %w", err)

              	}

              	logger.V(1).Info("Found workspace pod", "pod", pod.Name)

Contributor

JGuinegagne Oct 27, 2025

sanity-check: are you sure you want to record this log? Your call, I'm just worried about the amount of logs in the Workspace reconciliation loop.

Contributor Author

aws-prayags Oct 28, 2025

fair point, removed

internal/controller/state_machine.go Outdated

    
              // NewStateMachine creates a new StateMachine

              func NewStateMachine(resourceManager *ResourceManager, statusManager *StatusManager, templateResolver *TemplateResolver, recorder record.EventRecorder) *StateMachine {

              func NewStateMachine(resourceManager *ResourceManager, statusManager *StatusManager, templateResolver *TemplateResolver, recorder record.EventRecorder, idleChecker *WorkspaceIdleChecker) *StateMachine {

Contributor

JGuinegagne Oct 27, 2025

optional: consider reformating with 1-arg per line.

internal/controller/state_machine.go Outdated

    
              	// If idle shutdown is not enabled, no requeue needed

              	if idleConfig == nil || !idleConfig.Enabled {

              		logger.V(1).Info("Idle shutdown not enabled")

Contributor

JGuinegagne Oct 27, 2025

debug?

Contributor Author

aws-prayags Oct 28, 2025 •

edited

Loading

the verbosity level is set to V(1) which is equivalent to debug
I can see them as DEBUG logs in controller logs

internal/controller/state_machine.go Outdated

    
              	logger.Info("Processing idle shutdown",

              		"timeout", idleConfig.TimeoutMinutes,

              		"resourceVersion", workspace.ResourceVersion)

Contributor

JGuinegagne Oct 27, 2025

combine logs please

internal/controller/state_machine.go Outdated

    
              	logger.Info("Updated workspace desired status to Stopped")

              	// Immediate requeue to start stopping process

              	return ctrl.Result{RequeueAfter: 0}, nil

Contributor

JGuinegagne Oct 27, 2025

consider adding a minimal wait here (few ms)

internal/controller/state_machine.go Outdated

    
              }

              // areWorkspacePodsReady checks if workspace pods are ready for idle checking

              func (sm *StateMachine) areWorkspacePodsReady(ctx context.Context, workspace *workspacev1alpha1.Workspace) (bool, error) {

Contributor

JGuinegagne Oct 27, 2025

based on logic, rename: isAtLeastOneWorkspacePodReady()

Contributor

JGuinegagne commented Oct 27, 2025

High-level question:

if the idleShutdown probe permanently fails, should the Workspace be flagged as degraded? or perhaps immediately stopped based on some settings?

Contributor Author

aws-prayags commented Oct 27, 2025

High-level question:

if the idleShutdown probe permanently fails, should the Workspace be flagged as degraded? or perhaps immediately stopped based on some settings?

yes, it should be - I'm tracking that in separate issue - #84

aws-prayags added 3 commits

October 27, 2025 13:09


          updated api version in sample yamls

f9749d2


          Addressed PR feedback

583be5a


          refactor to use probe schema

320aff2

Contributor Author

aws-prayags commented Oct 28, 2025

Few high-level comments:

pipe through the command as exec attribute of the idleShutdownConfig (following Probe precedent

look at probe interface, and evaluate if we should reuse the interface

if no reuse, then evaluate which attributes are relevant

I looked at the probe interface, exec is not applicable for our case since it's implementation only allows running a command and looks for exit code 0/1 and does not have ability to use returned value
I'm reusing probe's HTTPGet instead - and reusing the entire HTTPGetAction object - that includes the fields that we need.

Even though our implementation of HTTPGet detector internally uses pods/exec - that's an internal implementation detail. The detection method is still to make an http call at an endpoint - so I think it fits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet