Skip to content

Conversation

@ferdzo
Copy link
Owner

@ferdzo ferdzo commented Jan 20, 2026

This pull request introduces a major architectural change to the Function-as-a-Service (FaaS) platform, splitting auto-scaling responsibilities between the gateway (scale-up) and worker nodes (scale-down). The gateway now tracks active requests and pushes scale-up jobs, while workers independently manage scale-down and job consumption based on their own capacity. The documentation, core gateway, and worker logic have all been updated to reflect this new design. Additionally, a new script is added for updating the root filesystem, and system utility improvements are included.

Split Auto-Scaling Architecture

  • Documentation (documentation/DESIGN.md): The scaling design is rewritten to describe the new split architecture, detailing how the gateway handles scale-up and workers handle scale-down, including new diagrams and configuration tables. [1] [2]

Gateway: Scale-Up Logic

  • New gateway scaler (gateway/internal/scaler.go): Implements a background scaler that tracks active requests per function, calculates the target number of instances, and pushes provision jobs to Redis when more instances are needed.
  • Router integration (gateway/internal/router.go, gateway/main.go): The router now tracks requests for scaling, and the scaler is started as a background process in the gateway's main entrypoint. [1] [2] [3] [4]

Worker: Scale-Down and Capacity Management

  • New worker reaper (worker/internal/reaper.go): Adds a reaper that periodically checks for idle instances and implements a two-tier scale-down strategy (to 1 hot instance after 30s, to zero after 5m idle).
  • Instance activity tracking (worker/internal/instance.go, worker/internal/proxy.go): Each instance now updates its last request time on every proxied request, enabling accurate idle detection for scale-down. [1] [2]
  • Removes old scaling config struct from worker config.

System Utilities and Scripts

  • Adds a new script for updating the root filesystem image, which builds the latest aether-env binary and installs it along with an init script. (scripts/update-rootfs.sh)
  • Adds a utility function to get available RAM from /proc/meminfo for worker capacity checks. (shared/system/system.go)

Minor Cleanups

  • Removes unused code and comments, and makes minor improvements in init and config files. [1] [2] [3]

This refactor lays the foundation for more reliable, scalable, and decoupled function instance management, with clear separation of concerns between control plane and worker nodes.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements a major architectural change to the auto-scaling system, splitting responsibilities between the gateway (scale-up) and workers (scale-down). The gateway now tracks active requests and pushes provision jobs when more capacity is needed, while workers independently manage instance lifecycle and self-regulate job consumption based on local capacity constraints.

Changes:

  • Gateway implements new scale-up logic that tracks active requests and pushes provision jobs to Redis
  • Workers implement new scale-down "reaper" that manages two-tier idle timeout strategy (30s to 1 instance, 5m to zero)
  • Workers add capacity gates to prevent overconsumption based on instance count and available RAM
  • Documentation updated to reflect the new split architecture with diagrams and configuration tables

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
gateway/internal/scaler.go New scaler that tracks active requests per function and pushes provision jobs when scale-up is needed
gateway/internal/router.go Integration with scaler to track request start/completion
gateway/main.go Initialization and startup of the scaler background loop
worker/internal/reaper.go New reaper component implementing two-tier scale-down strategy
worker/internal/worker.go Added capacity gate logic checking instance count and free RAM before consuming jobs
worker/internal/instance.go Added Touch() and LastRequest() methods for tracking instance activity
worker/internal/proxy.go Integration to update last request time on each proxied request
worker/internal/scaler.go Removed old worker-side scaler implementation
worker/internal/config.go Removed ScalingConfig struct (no longer needed)
shared/system/system.go Added GetFreeRAM() utility for checking available memory
shared/protocol/messages.go Formatting cleanup for struct field alignment
scripts/update-rootfs.sh New script for updating root filesystem with aether-env binary
init/main.go Minor comment cleanup
documentation/DESIGN.md Updated architecture diagrams and scaling documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 13 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +83
go func(fid, iid string) {
if err := r.worker.StopInstance(fid, iid); err != nil {
logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
}
}(functionID, inst.ID)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +114
go func(fid, iid string) {
if err := r.worker.StopInstance(fid, iid); err != nil {
logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err)
}
}(functionID, inst.ID)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.

The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.

Copilot uses AI. Check for mistakes.
func (p *Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
p.instance.IncrementActiveRequests()
defer p.instance.DecrementActiveRequests()
p.instance.Touch() // Update last request time
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant operation: IncrementActiveRequests already updates lastRequestTime (lines 69-71 in instance.go), so calling Touch here is redundant. This creates unnecessary lock contention by acquiring the instance.mu lock twice per request instead of once.

Consider removing this Touch call since IncrementActiveRequests already handles the timestamp update, or remove the timestamp update from IncrementActiveRequests if Touch is meant to be the only way to update it.

Suggested change
p.instance.Touch() // Update last request time

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +112
w.mu.Unlock()

if totalInstances >= maxInstances {
logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)
time.Sleep(500 * time.Millisecond)
continue
}
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: The capacity check releases the worker.mu lock at line 106, then performs the capacity check at line 108. Between releasing the lock and checking totalInstances, other goroutines could spawn new instances, potentially exceeding the maxInstances limit.

This is a Time-of-Check-Time-of-Use (TOCTOU) vulnerability. Multiple workers could simultaneously pass the capacity check when at 9 instances, then all spawn new instances, exceeding the limit. Consider checking capacity again after acquiring a job from the queue, or using atomic operations to reserve capacity before spawning.

Suggested change
w.mu.Unlock()
if totalInstances >= maxInstances {
logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)
time.Sleep(500 * time.Millisecond)
continue
}
if totalInstances >= maxInstances {
w.mu.Unlock()
logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances)
time.Sleep(500 * time.Millisecond)
continue
}
w.mu.Unlock()

Copilot uses AI. Check for mistakes.

echo ""
echo "=== Installing aether-env ==="
cp aether-env "$MOUNT_POINT/usr/bin/aether-env"
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path reference issue: After changing directory to PROJECT_ROOT/init at line 22, the script attempts to copy 'aether-env' at line 43 using a relative path. However, the current directory has changed, so 'aether-env' refers to the file in the init directory, not the original AETHER_ENV_BIN path.

While this might work if we're still in the init directory, the script should use the absolute path AETHER_ENV_BIN for clarity and reliability. Consider changing line 43 to use the full path: cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"

Suggested change
cp aether-env "$MOUNT_POINT/usr/bin/aether-env"
cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"

Copilot uses AI. Check for mistakes.
Comment on lines +370 to +372
- Each worker checks capacity independently (no coordination)
- Reaper runs on all workers, but only stops 1 instance per cycle
- Gateway is unaware of worker count (fully decoupled)
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that workers check capacity "independently (no coordination)" and that "each worker checks capacity independently", but the actual capacity check has a TOCTOU race condition. Between checking capacity (line 101-106) and spawning instances (line 188 in handleJob), multiple workers could exceed the limit. This discrepancy between the documented behavior and actual implementation could mislead users about the system's reliability.

Consider either: (1) updating the documentation to mention this race condition and its implications, or (2) implementing proper coordination to ensure the documented behavior is accurate.

Suggested change
- Each worker checks capacity independently (no coordination)
- Reaper runs on all workers, but only stops 1 instance per cycle
- Gateway is unaware of worker count (fully decoupled)
- Each worker checks capacity independently (no coordination); this is a best-effort check and is subject to a TOCTOU race with instance spawning
- In periods of high concurrency, multiple workers may all pass the capacity check and briefly exceed `MaxInstances` in aggregate; strict enforcement would require additional coordination
- Reaper runs on all workers, but only stops 1 instance per cycle
- Gateway is unaware of worker count (fully decoupled), so it does not enforce a global capacity limit across workers

Copilot uses AI. Check for mistakes.

logger.Info("watching queue", "queue", protocol.QueueVMProvision)

maxInstances := 10
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number: The maxInstances value is hardcoded to 10 (line 94), but this should be configurable or at least defined as a constant. The documentation (line 364) mentions "MaxInstances | 10 | Per-worker capacity limit", but having it hardcoded makes it difficult to adjust per deployment or environment.

Consider defining this as a configuration parameter or at minimum as a package-level constant with a clear name like DefaultMaxInstances or MaxInstancesPerWorker.

Copilot uses AI. Check for mistakes.
Comment on lines +110 to +118
time.Sleep(500 * time.Millisecond)
continue
}

freeRAM, err := system.GetFreeRAM()
if err == nil && freeRAM < 500 {
logger.Info("low memory, waiting", "free_mb", freeRAM)
time.Sleep(500 * time.Millisecond)
continue
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient polling: When at capacity or low memory, the worker sleeps for 500ms before checking again (lines 110, 117). With a 1-second BLPop timeout (line 121), this creates a cycle time of ~1.5 seconds when capacity-constrained. This polling approach is less efficient than event-driven alternatives.

Consider either: (1) using a longer sleep duration when at capacity to reduce CPU overhead, (2) implementing an exponential backoff, or (3) using a channel-based notification system when capacity frees up. However, this is acceptable for the current design and is not critical.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant