Skip to content

Conversation

@aws-prayags
Copy link
Contributor

Add Idle Shutdown Support for Workspace and WorkspaceTemplates

Templates can define default idle shutdown settings settings that workspaces inherit or override within configurable bounds.
Workspaces automatically stop after the configured idle period.
Templates can enforce policies by disabling overrides or setting timeout limits.

Future items for idle-shutdown - #84

Other changes

  • Fixed validation loop where template failures retried indefinitely instead of stopping reconciliation
  • (Temp - to check with team) Removed XValidation rule that failed on legitimate finalizer updates

Testing

  • Tested template inheritance, workspace overrides, and policy enforcement
  • Verified idle detection works for both JupyterLab and CodeEditor applications in sagemaker-distribution images
  • (Tests yet to be added, will be added in this PR)

@aws-prayags aws-prayags force-pushed the feat/idle-shutdown1 branch 5 times, most recently from 14d665b to a875077 Compare October 27, 2025 18:40
@aws-prayags aws-prayags marked this pull request as ready for review October 27, 2025 18:47
@JGuinegagne JGuinegagne changed the title feat: add support for idle-shutdown of workspaces feat: idle-shutdown of workspaces Oct 27, 2025
Copy link
Contributor

@JGuinegagne JGuinegagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few high-level comments:

  • pipe through the command as exec attribute of the idleShutdownConfig (following Probe precedent
  • look at probe interface, and evaluate if we should reuse the interface
  • if no reuse, then evaluate which attributes are relevant

// EndpointCheck specifies HTTP endpoint to check for idle status
// +optional
EndpointCheck *EndpointCheckSpec `json:"endpointCheck,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: should there be a HealthCheckIntervalMinutes attribute or similar?

Consider following the precedent of ReadinessProbe:

kubectl explain pod.spec.containers.readinessProbe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this - my preference is to start with a reasonable default ( in this PR it's 5 mins ) and add in support for HealthCheckIntervalMinutes in future if needed.

}
}

// CheckWorkspaceIdle checks if a workspace is idle using the configured detection method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add to the comment what the bool, bool return type means?

Consider creating a struct for better readability/maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an IdleCheckResult struct

return false, true, fmt.Errorf("failed to find workspace pod: %w", err)
}

logger.V(1).Info("Found workspace pod", "pod", pod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sanity-check: are you sure you want to record this log? Your call, I'm just worried about the amount of logs in the Workspace reconciliation loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point, removed


// NewStateMachine creates a new StateMachine
func NewStateMachine(resourceManager *ResourceManager, statusManager *StatusManager, templateResolver *TemplateResolver, recorder record.EventRecorder) *StateMachine {
func NewStateMachine(resourceManager *ResourceManager, statusManager *StatusManager, templateResolver *TemplateResolver, recorder record.EventRecorder, idleChecker *WorkspaceIdleChecker) *StateMachine {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: consider reformating with 1-arg per line.


// If idle shutdown is not enabled, no requeue needed
if idleConfig == nil || !idleConfig.Enabled {
logger.V(1).Info("Idle shutdown not enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug?

Copy link
Contributor Author

@aws-prayags aws-prayags Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the verbosity level is set to V(1) which is equivalent to debug
I can see them as DEBUG logs in controller logs


logger.Info("Processing idle shutdown",
"timeout", idleConfig.TimeoutMinutes,
"resourceVersion", workspace.ResourceVersion)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine logs please

logger.Info("Updated workspace desired status to Stopped")

// Immediate requeue to start stopping process
return ctrl.Result{RequeueAfter: 0}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a minimal wait here (few ms)

}

// areWorkspacePodsReady checks if workspace pods are ready for idle checking
func (sm *StateMachine) areWorkspacePodsReady(ctx context.Context, workspace *workspacev1alpha1.Workspace) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on logic, rename: isAtLeastOneWorkspacePodReady()

@JGuinegagne
Copy link
Contributor

High-level question:

  • if the idleShutdown probe permanently fails, should the Workspace be flagged as degraded? or perhaps immediately stopped based on some settings?

@aws-prayags
Copy link
Contributor Author

High-level question:

  • if the idleShutdown probe permanently fails, should the Workspace be flagged as degraded? or perhaps immediately stopped based on some settings?

yes, it should be - I'm tracking that in separate issue - #84

@aws-prayags
Copy link
Contributor Author

Few high-level comments:

  • pipe through the command as exec attribute of the idleShutdownConfig (following Probe precedent
  • look at probe interface, and evaluate if we should reuse the interface
  • if no reuse, then evaluate which attributes are relevant

I looked at the probe interface, exec is not applicable for our case since it's implementation only allows running a command and looks for exit code 0/1 and does not have ability to use returned value
I'm reusing probe's HTTPGet instead - and reusing the entire HTTPGetAction object - that includes the fields that we need.

Even though our implementation of HTTPGet detector internally uses pods/exec - that's an internal implementation detail. The detection method is still to make an http call at an endpoint - so I think it fits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants