-
Notifications
You must be signed in to change notification settings - Fork 0
Scaler refactor #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements a major architectural change to the auto-scaling system, splitting responsibilities between the gateway (scale-up) and workers (scale-down). The gateway now tracks active requests and pushes provision jobs when more capacity is needed, while workers independently manage instance lifecycle and self-regulate job consumption based on local capacity constraints.
Changes:
- Gateway implements new scale-up logic that tracks active requests and pushes provision jobs to Redis
- Workers implement new scale-down "reaper" that manages two-tier idle timeout strategy (30s to 1 instance, 5m to zero)
- Workers add capacity gates to prevent overconsumption based on instance count and available RAM
- Documentation updated to reflect the new split architecture with diagrams and configuration tables
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| gateway/internal/scaler.go | New scaler that tracks active requests per function and pushes provision jobs when scale-up is needed |
| gateway/internal/router.go | Integration with scaler to track request start/completion |
| gateway/main.go | Initialization and startup of the scaler background loop |
| worker/internal/reaper.go | New reaper component implementing two-tier scale-down strategy |
| worker/internal/worker.go | Added capacity gate logic checking instance count and free RAM before consuming jobs |
| worker/internal/instance.go | Added Touch() and LastRequest() methods for tracking instance activity |
| worker/internal/proxy.go | Integration to update last request time on each proxied request |
| worker/internal/scaler.go | Removed old worker-side scaler implementation |
| worker/internal/config.go | Removed ScalingConfig struct (no longer needed) |
| shared/system/system.go | Added GetFreeRAM() utility for checking available memory |
| shared/protocol/messages.go | Formatting cleanup for struct field alignment |
| scripts/update-rootfs.sh | New script for updating root filesystem with aether-env binary |
| init/main.go | Minor comment cleanup |
| documentation/DESIGN.md | Updated architecture diagrams and scaling documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 15 out of 16 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| go func(fid, iid string) { | ||
| if err := r.worker.StopInstance(fid, iid); err != nil { | ||
| logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err) | ||
| } | ||
| }(functionID, inst.ID) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.
The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.
| go func(fid, iid string) { | ||
| if err := r.worker.StopInstance(fid, iid); err != nil { | ||
| logger.Error("failed to stop instance", "function", fid, "instance", iid, "error", err) | ||
| } | ||
| }(functionID, inst.ID) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential deadlock: StopInstance is called in a goroutine while still holding the worker.mu lock. The goroutine will try to acquire worker.mu (line 330 in worker.go), but it's already held by the reaper's check method (line 42). This will cause a deadlock if the goroutine tries to execute before the lock is released.
The lock should be released before spawning the goroutine, or the StopInstance call should happen outside the locked section. Consider copying the instance ID and calling StopInstance after releasing the lock, or restructuring the code to avoid holding the lock when spawning goroutines that need it.
| func (p *Proxy) ServeHTTP(w http.ResponseWriter, r *http.Request) { | ||
| p.instance.IncrementActiveRequests() | ||
| defer p.instance.DecrementActiveRequests() | ||
| p.instance.Touch() // Update last request time |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant operation: IncrementActiveRequests already updates lastRequestTime (lines 69-71 in instance.go), so calling Touch here is redundant. This creates unnecessary lock contention by acquiring the instance.mu lock twice per request instead of once.
Consider removing this Touch call since IncrementActiveRequests already handles the timestamp update, or remove the timestamp update from IncrementActiveRequests if Touch is meant to be the only way to update it.
| p.instance.Touch() // Update last request time |
| w.mu.Unlock() | ||
|
|
||
| if totalInstances >= maxInstances { | ||
| logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances) | ||
| time.Sleep(500 * time.Millisecond) | ||
| continue | ||
| } |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race condition: The capacity check releases the worker.mu lock at line 106, then performs the capacity check at line 108. Between releasing the lock and checking totalInstances, other goroutines could spawn new instances, potentially exceeding the maxInstances limit.
This is a Time-of-Check-Time-of-Use (TOCTOU) vulnerability. Multiple workers could simultaneously pass the capacity check when at 9 instances, then all spawn new instances, exceeding the limit. Consider checking capacity again after acquiring a job from the queue, or using atomic operations to reserve capacity before spawning.
| w.mu.Unlock() | |
| if totalInstances >= maxInstances { | |
| logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances) | |
| time.Sleep(500 * time.Millisecond) | |
| continue | |
| } | |
| if totalInstances >= maxInstances { | |
| w.mu.Unlock() | |
| logger.Info("at capacity, waiting", "current", totalInstances, "max", maxInstances) | |
| time.Sleep(500 * time.Millisecond) | |
| continue | |
| } | |
| w.mu.Unlock() |
|
|
||
| echo "" | ||
| echo "=== Installing aether-env ===" | ||
| cp aether-env "$MOUNT_POINT/usr/bin/aether-env" |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Path reference issue: After changing directory to PROJECT_ROOT/init at line 22, the script attempts to copy 'aether-env' at line 43 using a relative path. However, the current directory has changed, so 'aether-env' refers to the file in the init directory, not the original AETHER_ENV_BIN path.
While this might work if we're still in the init directory, the script should use the absolute path AETHER_ENV_BIN for clarity and reliability. Consider changing line 43 to use the full path: cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env"
| cp aether-env "$MOUNT_POINT/usr/bin/aether-env" | |
| cp "$AETHER_ENV_BIN" "$MOUNT_POINT/usr/bin/aether-env" |
| - Each worker checks capacity independently (no coordination) | ||
| - Reaper runs on all workers, but only stops 1 instance per cycle | ||
| - Gateway is unaware of worker count (fully decoupled) |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation states that workers check capacity "independently (no coordination)" and that "each worker checks capacity independently", but the actual capacity check has a TOCTOU race condition. Between checking capacity (line 101-106) and spawning instances (line 188 in handleJob), multiple workers could exceed the limit. This discrepancy between the documented behavior and actual implementation could mislead users about the system's reliability.
Consider either: (1) updating the documentation to mention this race condition and its implications, or (2) implementing proper coordination to ensure the documented behavior is accurate.
| - Each worker checks capacity independently (no coordination) | |
| - Reaper runs on all workers, but only stops 1 instance per cycle | |
| - Gateway is unaware of worker count (fully decoupled) | |
| - Each worker checks capacity independently (no coordination); this is a best-effort check and is subject to a TOCTOU race with instance spawning | |
| - In periods of high concurrency, multiple workers may all pass the capacity check and briefly exceed `MaxInstances` in aggregate; strict enforcement would require additional coordination | |
| - Reaper runs on all workers, but only stops 1 instance per cycle | |
| - Gateway is unaware of worker count (fully decoupled), so it does not enforce a global capacity limit across workers |
|
|
||
| logger.Info("watching queue", "queue", protocol.QueueVMProvision) | ||
|
|
||
| maxInstances := 10 |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number: The maxInstances value is hardcoded to 10 (line 94), but this should be configurable or at least defined as a constant. The documentation (line 364) mentions "MaxInstances | 10 | Per-worker capacity limit", but having it hardcoded makes it difficult to adjust per deployment or environment.
Consider defining this as a configuration parameter or at minimum as a package-level constant with a clear name like DefaultMaxInstances or MaxInstancesPerWorker.
| time.Sleep(500 * time.Millisecond) | ||
| continue | ||
| } | ||
|
|
||
| freeRAM, err := system.GetFreeRAM() | ||
| if err == nil && freeRAM < 500 { | ||
| logger.Info("low memory, waiting", "free_mb", freeRAM) | ||
| time.Sleep(500 * time.Millisecond) | ||
| continue |
Copilot
AI
Jan 24, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient polling: When at capacity or low memory, the worker sleeps for 500ms before checking again (lines 110, 117). With a 1-second BLPop timeout (line 121), this creates a cycle time of ~1.5 seconds when capacity-constrained. This polling approach is less efficient than event-driven alternatives.
Consider either: (1) using a longer sleep duration when at capacity to reduce CPU overhead, (2) implementing an exponential backoff, or (3) using a channel-based notification system when capacity frees up. However, this is acceptable for the current design and is not critical.
This pull request introduces a major architectural change to the Function-as-a-Service (FaaS) platform, splitting auto-scaling responsibilities between the gateway (scale-up) and worker nodes (scale-down). The gateway now tracks active requests and pushes scale-up jobs, while workers independently manage scale-down and job consumption based on their own capacity. The documentation, core gateway, and worker logic have all been updated to reflect this new design. Additionally, a new script is added for updating the root filesystem, and system utility improvements are included.
Split Auto-Scaling Architecture
documentation/DESIGN.md): The scaling design is rewritten to describe the new split architecture, detailing how the gateway handles scale-up and workers handle scale-down, including new diagrams and configuration tables. [1] [2]Gateway: Scale-Up Logic
gateway/internal/scaler.go): Implements a background scaler that tracks active requests per function, calculates the target number of instances, and pushes provision jobs to Redis when more instances are needed.gateway/internal/router.go,gateway/main.go): The router now tracks requests for scaling, and the scaler is started as a background process in the gateway's main entrypoint. [1] [2] [3] [4]Worker: Scale-Down and Capacity Management
worker/internal/reaper.go): Adds a reaper that periodically checks for idle instances and implements a two-tier scale-down strategy (to 1 hot instance after 30s, to zero after 5m idle).worker/internal/instance.go,worker/internal/proxy.go): Each instance now updates its last request time on every proxied request, enabling accurate idle detection for scale-down. [1] [2]System Utilities and Scripts
aether-envbinary and installs it along with an init script. (scripts/update-rootfs.sh)/proc/meminfofor worker capacity checks. (shared/system/system.go)Minor Cleanups
This refactor lays the foundation for more reliable, scalable, and decoupled function instance management, with clear separation of concerns between control plane and worker nodes.