Scaler refactor #3

ferdzo · 2026-01-20T08:48:08Z

This pull request introduces a major architectural change to the Function-as-a-Service (FaaS) platform, splitting auto-scaling responsibilities between the gateway (scale-up) and worker nodes (scale-down). The gateway now tracks active requests and pushes scale-up jobs, while workers independently manage scale-down and job consumption based on their own capacity. The documentation, core gateway, and worker logic have all been updated to reflect this new design. Additionally, a new script is added for updating the root filesystem, and system utility improvements are included.

Split Auto-Scaling Architecture

Documentation (documentation/DESIGN.md): The scaling design is rewritten to describe the new split architecture, detailing how the gateway handles scale-up and workers handle scale-down, including new diagrams and configuration tables. [1] [2]

Gateway: Scale-Up Logic

New gateway scaler (gateway/internal/scaler.go): Implements a background scaler that tracks active requests per function, calculates the target number of instances, and pushes provision jobs to Redis when more instances are needed.
Router integration (gateway/internal/router.go, gateway/main.go): The router now tracks requests for scaling, and the scaler is started as a background process in the gateway's main entrypoint. [1] [2] [3] [4]

Worker: Scale-Down and Capacity Management

New worker reaper (worker/internal/reaper.go): Adds a reaper that periodically checks for idle instances and implements a two-tier scale-down strategy (to 1 hot instance after 30s, to zero after 5m idle).
Instance activity tracking (worker/internal/instance.go, worker/internal/proxy.go): Each instance now updates its last request time on every proxied request, enabling accurate idle detection for scale-down. [1] [2]
Removes old scaling config struct from worker config.

System Utilities and Scripts

Adds a new script for updating the root filesystem image, which builds the latest aether-env binary and installs it along with an init script. (scripts/update-rootfs.sh)
Adds a utility function to get available RAM from /proc/meminfo for worker capacity checks. (shared/system/system.go)

Minor Cleanups

Removes unused code and comments, and makes minor improvements in init and config files. [1] [2] [3]

This refactor lays the foundation for more reliable, scalable, and decoupled function instance management, with clear separation of concerns between control plane and worker nodes.

…f worker.

Copilot

Pull request overview

This pull request implements a major architectural change to the auto-scaling system, splitting responsibilities between the gateway (scale-up) and workers (scale-down). The gateway now tracks active requests and pushes provision jobs when more capacity is needed, while workers independently manage instance lifecycle and self-regulate job consumption based on local capacity constraints.

Changes:

Gateway implements new scale-up logic that tracks active requests and pushes provision jobs to Redis
Workers implement new scale-down "reaper" that manages two-tier idle timeout strategy (30s to 1 instance, 5m to zero)
Workers add capacity gates to prevent overconsumption based on instance count and available RAM
Documentation updated to reflect the new split architecture with diagrams and configuration tables

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
gateway/internal/scaler.go	New scaler that tracks active requests per function and pushes provision jobs when scale-up is needed
gateway/internal/router.go	Integration with scaler to track request start/completion
gateway/main.go	Initialization and startup of the scaler background loop
worker/internal/reaper.go	New reaper component implementing two-tier scale-down strategy
worker/internal/worker.go	Added capacity gate logic checking instance count and free RAM before consuming jobs
worker/internal/instance.go	Added Touch() and LastRequest() methods for tracking instance activity
worker/internal/proxy.go	Integration to update last request time on each proxied request
worker/internal/scaler.go	Removed old worker-side scaler implementation
worker/internal/config.go	Removed ScalingConfig struct (no longer needed)
shared/system/system.go	Added GetFreeRAM() utility for checking available memory
shared/protocol/messages.go	Formatting cleanup for struct field alignment
scripts/update-rootfs.sh	New script for updating root filesystem with aether-env binary
init/main.go	Minor comment cleanup
documentation/DESIGN.md	Updated architecture diagrams and scaling documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

worker/internal/worker.go

shared/system/system.go

scripts/update-rootfs.sh

worker/internal/reaper.go

gateway/internal/scaler.go

shared/system/system.go

worker/internal/reaper.go

gateway/internal/scaler.go

worker/internal/worker.go

Copilot

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 13 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-24T10:54:33Z

worker/internal/reaper.go

+			go func(fid, iid string) {
+				if err := r.worker.StopInstance(fid, iid); err != nil {
+					logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
+				}
+			}(functionID, inst.ID)


Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

Copilot · 2026-01-24T10:54:33Z

worker/internal/reaper.go

+		go func(fid, iid string) {
+			if err := r.worker.StopInstance(fid, iid); err != nil {
+				logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
+			}
+		}(functionID, inst.ID)


Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

Copilot · 2026-01-24T10:54:33Z

worker/internal/proxy.go

 func (p *Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 	p.instance.IncrementActiveRequests()
 	defer p.instance.DecrementActiveRequests()
+	p.instance.Touch() // Update last request time


Redundant operation: IncrementActiveRequests already updates lastRequestTime (lines 69-71 in instance.go), so calling Touch here is redundant. This creates unnecessary lock contention by acquiring the instance.mu lock twice per request instead of once.

Consider removing this Touch call since IncrementActiveRequests already handles the timestamp update, or remove the timestamp update from IncrementActiveRequests if Touch is meant to be the only way to update it.

Suggested change

p.instance.Touch() // Update last request time

shared/system/system.go

worker/internal/worker.go

Copilot · 2026-01-24T10:54:35Z

worker/internal/worker.go

+			w.mu.Unlock()
+
+			if totalInstances >= maxInstances {
+				logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)
+				time.Sleep(500 * time.Millisecond)
+				continue
+			}


Race condition: The capacity check releases the worker.mu lock at line 106, then performs the capacity check at line 108. Between releasing the lock and checking totalInstances, other goroutines could spawn new instances, potentially exceeding the maxInstances limit.

This is a Time-of-Check-Time-of-Use (TOCTOU) vulnerability. Multiple workers could simultaneously pass the capacity check when at 9 instances, then all spawn new instances, exceeding the limit. Consider checking capacity again after acquiring a job from the queue, or using atomic operations to reserve capacity before spawning.

Suggested change

w.mu.Unlock()

if totalInstances >= maxInstances {

logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)

time.Sleep(500 * time.Millisecond)

continue

}

if totalInstances >= maxInstances {

w.mu.Unlock()

logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)

time.Sleep(500 * time.Millisecond)

continue

}

w.mu.Unlock()

Copilot · 2026-01-24T10:54:35Z

scripts/update-rootfs.sh

+
+echo ""
+echo "=== Installing aether-env ==="
+cp aether-env "$MOUNT_POINT/usr/bin/aether-env"


Path reference issue: After changing directory to PROJECT_ROOT/init at line 22, the script attempts to copy 'aether-env' at line 43 using a relative path. However, the current directory has changed, so 'aether-env' refers to the file in the init directory, not the original AETHER_ENV_BIN path.

While this might work if we're still in the init directory, the script should use the absolute path AETHER_ENV_BIN for clarity and reliability. Consider changing line 43 to use the full path: cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"

Suggested change

cp aether-env "$MOUNT_POINT/usr/bin/aether-env"

cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"

Copilot · 2026-01-24T10:54:36Z

documentation/DESIGN.md

+- Each worker checks capacity independently (no coordination)
+- Reaper runs on all workers, but only stops 1 instance per cycle
+- Gateway is unaware of worker count (fully decoupled)


The documentation states that workers check capacity "independently (no coordination)" and that "each worker checks capacity independently", but the actual capacity check has a TOCTOU race condition. Between checking capacity (line 101-106) and spawning instances (line 188 in handleJob), multiple workers could exceed the limit. This discrepancy between the documented behavior and actual implementation could mislead users about the system's reliability.

Consider either: (1) updating the documentation to mention this race condition and its implications, or (2) implementing proper coordination to ensure the documented behavior is accurate.

Suggested change

- Each worker checks capacity independently (no coordination)

- Reaper runs on all workers, but only stops 1 instance per cycle

- Gateway is unaware of worker count (fully decoupled)

- Each worker checks capacity independently (no coordination); this is a best-effort check and is subject to a TOCTOU race with instance spawning

- In periods of high concurrency, multiple workers may all pass the capacity check and briefly exceed `MaxInstances` in aggregate; strict enforcement would require additional coordination

- Reaper runs on all workers, but only stops 1 instance per cycle

- Gateway is unaware of worker count (fully decoupled), so it does not enforce a global capacity limit across workers

Copilot · 2026-01-24T10:54:36Z

worker/internal/worker.go


 	logger.Info("watching queue", "queue", protocol.QueueVMProvision)

+	maxInstances := 10


Magic number: The maxInstances value is hardcoded to 10 (line 94), but this should be configurable or at least defined as a constant. The documentation (line 364) mentions "MaxInstances | 10 | Per-worker capacity limit", but having it hardcoded makes it difficult to adjust per deployment or environment.

Consider defining this as a configuration parameter or at minimum as a package-level constant with a clear name like DefaultMaxInstances or MaxInstancesPerWorker.

Copilot · 2026-01-24T10:54:36Z

worker/internal/worker.go

+				time.Sleep(500 * time.Millisecond)
+				continue
+			}
+
+			freeRAM, err := system.GetFreeRAM()
+			if err == nil && freeRAM < 500 {
+				logger.Info("low memory, waiting", "free_mb", freeRAM)
+				time.Sleep(500 * time.Millisecond)
+				continue


Inefficient polling: When at capacity or low memory, the worker sleeps for 500ms before checking again (lines 110, 117). With a 1-second BLPop timeout (line 121), this creates a cycle time of ~1.5 seconds when capacity-constrained. This polling approach is less efficient than event-driven alternatives.

Consider either: (1) using a longer sleep duration when at capacity to reduce CPU overhead, (2) implementing an exponential backoff, or (3) using a channel-based notification system when capacity frees up. However, this is acceptable for the current design and is not critical.

ferdzo added 2 commits January 19, 2026 23:13

Upgraded scaler design, main scaling logic moved to Gateway instead o…

de591f8

…f worker.

Removed comments

f5ccce0

ferdzo requested a review from Copilot January 20, 2026 08:48

Copilot started reviewing on behalf of ferdzo January 20, 2026 08:48 View session

Copilot AI reviewed Jan 20, 2026

View reviewed changes

Resolved Copilot comments.

fef1af9

ferdzo requested a review from Copilot January 24, 2026 10:47

Copilot started reviewing on behalf of ferdzo January 24, 2026 10:47 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

	cp aether-env "$MOUNT_POINT/usr/bin/aether-env"
	cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"


		logger.Info("watching queue", "queue", protocol.QueueVMProvision)

		maxInstances := 10

Scaler refactor #3

Are you sure you want to change the base?

Scaler refactor #3

Uh oh!

Conversation

ferdzo commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant