-
Notifications
You must be signed in to change notification settings - Fork 9
Resource check on event #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
227d614 to
f514e4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements event-driven resource checking to prevent the use of outdated resource check results. Instead of relying solely on periodic timer-based checks, the system now triggers resource availability checks immediately when services connect or disconnect.
Key Changes:
- Replaced polling-based resource waiting with an event-driven channel-based system
- Added immediate resource check triggering on service lifecycle events (connection/disconnection)
- Upgraded Go version from 1.22 to 1.23 to support new language features (iter.Seq)
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| monitor_resources_test.go | Updated test to remove outdated retry logic and added new test implementation for event-driven resource checking |
| monitor_resources.go | Refactored monitoring to use select statement with timer and pause/resume channel, added broadcast functions for resource change events |
| management_api.go | Updated to dereference resourcesAvailable map directly instead of through pointer |
| main_test.go | Added new test case configuration for validating that outdated resource check results are not used |
| main.go | Extended ResourceManager with new mutexes and channels for event-driven coordination, updated resource reservation to use channel-based waiting instead of sleep polling |
| go.mod | Upgraded Go version to 1.23 |
| .github/workflows/tests.yml | Updated CI workflow to use Go 1.23 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
monitor_resources_test.go
Outdated
|
|
||
| time.Sleep(3500 * time.Millisecond) | ||
|
|
||
| serviceOneHealthCheckResponse := getHealthcheckResponse(t, serviceTwoHealthCheckAddress) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable is named "serviceOneHealthCheckResponse" but it's actually fetching the response from serviceTwoHealthCheckAddress. This is confusing and should be renamed to "serviceTwoHealthCheckResponse" for clarity.
main.go
Outdated
| rm.storeRunningServiceNoLock(name, runningService) | ||
| if runningService.activeConnections == 0 { | ||
| log.Printf("[%s] All connections closed, sending resourceChange event", name) | ||
| rm.broadcastResourceChanges(maps.Keys(serviceConfigByName[name].ResourceRequirements)) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential deadlock: broadcastResourceChanges is called while serviceMutex is held (line 113). This method acquires resourceChangeByResourceMutex, which could lead to lock ordering issues if another goroutine tries to acquire these locks in a different order.
monitor_resources_test.go
Outdated
| //time.Sleep(1 * time.Second) | ||
| //statusResponse = getStatusFromManagementAPI(t, managementApiAddress) | ||
| //assertPortsAreClosed(t, []string{serviceOneHealthCheckAddress, serviceTwoHealthCheckAddress}) | ||
| //verifyServiceStatus(t, statusResponse, serviceOneName, true, map[string]int{resourceName: 11}) | ||
| //verifyServiceStatus(t, statusResponse, serviceTwoName, false, map[string]int{resourceName: 0}) | ||
| //verifyTotalResourcesAvailable(t, statusResponse, map[string]int{resourceName: 11}) | ||
| //verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 10}) | ||
|
|
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commented-out code block (lines 141-147) should be removed if it's not needed. Leaving large blocks of commented code reduces code maintainability and can cause confusion about whether it should be used or not.
| //time.Sleep(1 * time.Second) | |
| //statusResponse = getStatusFromManagementAPI(t, managementApiAddress) | |
| //assertPortsAreClosed(t, []string{serviceOneHealthCheckAddress, serviceTwoHealthCheckAddress}) | |
| //verifyServiceStatus(t, statusResponse, serviceOneName, true, map[string]int{resourceName: 11}) | |
| //verifyServiceStatus(t, statusResponse, serviceTwoName, false, map[string]int{resourceName: 0}) | |
| //verifyTotalResourcesAvailable(t, statusResponse, map[string]int{resourceName: 11}) | |
| //verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 10}) |
| handleNotEnoughResource(requestingService, outputError, true, resource, currentlyAvailableAmount, resourceRequirements[resource]) | ||
| return &resource | ||
| } | ||
| //todo: maxwaittime |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment without a clear action plan or issue reference. Consider either implementing this or adding a reference to a tracking issue, or removing the comment if it's not needed.
| //todo: maxwaittime | |
| // NOTE: no maximum wait time is enforced here; callers are expected to wait until resources become available. |
main.go
Outdated
| runningService.stderrWriter.FinalFlush() | ||
| releaseResourcesWhenServiceMutexIsLocked(service.ResourceRequirements) | ||
| delete(resourceManager.runningServices, service.Name) | ||
| resourceManager.broadcastResourceChanges(maps.Keys(service.ResourceRequirements)) |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential deadlock: broadcastResourceChanges is called while serviceMutex is held (line 1458). This is the same issue as line 113 - calling a method that acquires resourceChangeByResourceMutex while holding serviceMutex can lead to lock ordering issues and potential deadlocks.
| if !ok { | ||
| panic("pauseResumeChan closed unexpectedly, crashing to avoid infinite loop") | ||
| } | ||
| timer.Stop() |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timer is stopped (line 28) but not drained. After calling Stop(), if the timer has already fired, the channel may still contain a value. This should be drained to avoid a race condition. Consider using a select with default case to drain the channel after stopping.
| timer.Stop() | |
| if !timer.Stop() { | |
| select { | |
| case <-timer.C: | |
| default: | |
| } | |
| } |
| resourceManager.resourceChangeByResourceMutex.Unlock() | ||
| } | ||
|
|
||
| func UnpauseResourceAvailabilityMonitoring(resourceName string) { |
Copilot
AI
Dec 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function name is inconsistent with Go naming conventions. Public function names should be PascalCase, but should also follow consistent patterns. Consider renaming to "unpauseResourceAvailabilityMonitoring" (lowercase first letter) if this is meant to be package-private, or ensure it follows a consistent naming pattern with other public functions in the package.
| func UnpauseResourceAvailabilityMonitoring(resourceName string) { | |
| func unpauseResourceAvailabilityMonitoring(resourceName string) { |
95af14b to
5549e36
Compare
9e58013 to
96f4f99
Compare
96f4f99 to
c494ecf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
README.md
Outdated
| - `listen_port`: Port the service listens on | ||
| - `is_running`: Whether the service is currently running | ||
| - `status`: stopped, waiting_for_resources, starting or running. | ||
| - `active_connections`: Number of active connections to the service |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation field 'active_connections' is outdated and no longer exists in the API. According to the changes in management_api.go, this should document 'waiting_connections' and 'proxied_connections' as separate fields instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (rm ResourceManager) broadcastResourceChanges(resources iter.Seq[string], recheckNeeded bool) { | ||
| for resource := range resources { | ||
| rm.broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked(resource, recheckNeeded) | ||
| } | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function broadcastResourceChanges calls broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked which, based on its name, expects resourceChangeByResourceMutex to already be locked. However, this function does not acquire the mutex before calling it. This function is called from incrementConnection (line 119 in main.go) and cleanUpStoppedServiceWhenServiceMutexIsLocked (line 1527 in main.go), both of which only hold serviceMutex. This will cause race conditions when accessing resourceChangeByResourceChans. Add mutex locking: acquire resourceChangeByResourceMutex at the start of this function and release it after the loop completes.
main.go
Outdated
| if interrupted { | ||
| return nil, fmt.Errorf("interrupt signal was received") | ||
| } | ||
|
|
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a duplicate check of the one on lines 745-746. The second check is redundant and can be removed to simplify the code.
| if interrupted { | |
| return nil, fmt.Errorf("interrupt signal was received") | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| resourceManager.serviceMutex.Lock() | ||
| cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, false) | ||
| cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, true) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third parameter 'true' is unclear without context. Consider extracting this to a named constant like 'shouldReleaseResources' to improve code readability.
| cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, true) | |
| shouldReleaseResources := true | |
| cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, shouldReleaseResources) |
| handleNotEnoughResource(requestingService, outputError, true, resource, currentlyAvailableAmount, resourceRequirements[resource]) | ||
| return &resource | ||
| } | ||
| //todo: maxwaittime |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo comment should be properly formatted with a capital 'T' and proper spacing: "TODO: max wait time".
| //todo: maxwaittime | |
| // TODO: max wait time |
main.go
Outdated
| stopService(*findServiceConfigByName(earliestLastUsedService)) | ||
| continue | ||
| } | ||
| //TODO: add resource amounts |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO comment is vague. It should specify what resource amounts need to be added and where (e.g., "TODO: add resource amounts to log message").
| //TODO: add resource amounts | |
| // TODO: Add requested and currently available amounts for *missingResource to the following log message. |
config.go
Outdated
| Amount int `json:"Amount"` | ||
| CheckCommand string `json:"CheckCommand"` | ||
| CheckIntervalMilliseconds uint `json:"CheckIntervalMilliseconds"` | ||
| CheckIntervalMilliseconds uint `json:"CheckWhenNotEnoughIntervalMilliseconds"` |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct field name 'CheckIntervalMilliseconds' doesn't match the JSON tag 'CheckWhenNotEnoughIntervalMilliseconds'. This will cause incorrect unmarshaling. The field name should be 'CheckWhenNotEnoughIntervalMilliseconds' to match the JSON tag.
| verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 0}) | ||
| verifyServiceStatus(t, statusResponse, serviceOneName, ServiceStateStopped, 0, 0, map[string]int{resourceName: 0}) | ||
| verifyServiceStatus(t, statusResponse, serviceTwoName, ServiceStateStopped, 0, 0, map[string]int{resourceName: 0}) | ||
| verifyResourceUsage(t, statusResponse, map[string]int{resourceName: 0}, map[string]int{resourceName: 1}, map[string]int{resourceName: 0}, map[string]int{resourceName: 0}) |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple consecutive map[string]int parameters make this call hard to read and error-prone. Consider using a struct with named fields (e.g., ExpectedResourceUsage{Reserved: ..., Free: ..., Used: ..., Total: ...}) to improve clarity.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
90a451a to
e4a6c17
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "healthcheck timed out, not starting another healthcheck command: not enough time left for another HealthcheckInterval(%v) in StartupTimeout(%v)", | ||
| sleepDuration, | ||
| timeout, |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message format string references variable names instead of using formatting placeholders. The message should format the duration values properly instead of printing "%dms" and "%s" literally.
| } | ||
| } | ||
| } | ||
| if !enoughOfResource && (!firstCheckNeeded || resourceConfig.CheckCommand == "") { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the logic at line 1056, the condition checks (!firstCheckNeeded || resourceConfig.CheckCommand == "") which means resources without CheckCommand will immediately return as not enough resource without proper error logging when firstCheckNeeded is true. This seems inconsistent - resources without CheckCommand should be handled the same way regardless of the firstCheckNeeded flag.
| if !enoughOfResource && (!firstCheckNeeded || resourceConfig.CheckCommand == "") { | |
| if resourceConfig.CheckCommand == "" { | |
| if !enoughOfResource { | |
| handleNotEnoughResource(requestingService, outputError, false, resource, currentlyAvailableAmount, requiredAmount) | |
| return &resource | |
| } | |
| } else if !firstCheckNeeded && !enoughOfResource { |
| reservedAmount, ok := resourceManager.resourcesReserved[resource] | ||
| if !ok { | ||
| log.Printf( | ||
| "[%s] ERROR: Resource \"%s\" is missing from the list of the reserved resources. This shouldn't be happening", | ||
| requestingService, | ||
| resource, | ||
| ) | ||
| reservedAmount = 0 | ||
| } | ||
| enoughOfResource := requiredAmount <= currentlyAvailableAmount-reservedAmount | ||
| return currentlyAvailableAmount, enoughOfResource |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential race condition in accessing resourceManager.resourcesReserved at line 1130. This field is accessed without holding serviceMutex, but it's modified in other places while holding serviceMutex. This could lead to inconsistent reads or race detector warnings. The access should be protected by acquiring serviceMutex first.
| //This is not optimal since it is blocked until the first resource change, which could take longer than the consequent ones. | ||
| //But in a scenario when there are enough resources, we need to wait for them all anyway, and if it's not enough, well, we're still waiting. |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "This is not optimal since it is blocked until the first resource change" describes a known limitation but doesn't explain the impact or why this approach was chosen. Consider expanding this comment to explain the tradeoffs and whether this is acceptable for the current use case.
| //This is not optimal since it is blocked until the first resource change, which could take longer than the consequent ones. | |
| //But in a scenario when there are enough resources, we need to wait for them all anyway, and if it's not enough, well, we're still waiting. | |
| // This is not optimal since it is blocked until the first resource change for each | |
| // resource, which could take longer than some of the subsequent changes. We accept | |
| // this coarse‑grained blocking because: | |
| // * this path is only used for resources with a CheckCommand, and the number of | |
| // such resources per service is expected to be small; | |
| // * keeping the logic as a simple "wait for the first change, then re‑evaluate" | |
| // avoids more complex, concurrent coordination and reduces the chance of subtle | |
| // race conditions in the resource manager; and | |
| // * in the common case where enough resources eventually become available, the | |
| // service must wait for all required resources anyway, so the additional | |
| // latency from this blocking approach is acceptable for our current workloads. |
| } | ||
| } | ||
| time.Sleep(100 * time.Millisecond) //Give lmp time to catch up | ||
| //bug in the actual code here? |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at line 198 has a potential issue. It states "bug in the actual code here?" which is unclear. This comment should either be clarified with specific details about what bug is suspected, or it should be removed if the bug was already addressed.
| //bug in the actual code here? |
| } | ||
|
|
||
| func findFirstMissingResourceWhenServiceMutexIsLocked(resourceRequirements map[string]int, requestingService string, outputError bool) *string { | ||
| func findFirstMissingResourceWhenServiceMutexIsLocked(resourceRequirements map[string]int, requestingService string, outputError bool, firstCheckNeeded bool) *string { |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name 'firstCheckNeeded' is confusing and doesn't clearly convey its purpose. Based on the usage, it appears to control whether to wait for the first check command result. A clearer name would be 'shouldWaitForCheckCommand' or 'needsInitialCheckCommand'.
| free = resourceManager.resourcesAvailable[resourceName] | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential race condition when accessing resourceManager.resourcesAvailable at line 69. This field should be protected by resourcesAvailableMutex according to the comment in the struct definition, but here it's accessed while holding serviceMutex. Either acquire resourcesAvailableMutex before reading, or ensure the locking strategy is documented and consistent.
| func (rm ResourceManager) broadcastFirstChangeIfMutexIsLocked(resourceName string) { | ||
| resourceChangeByResourceChans, ok := rm.checkCommandFirstChangeByResourceChans[resourceName] | ||
| if !ok { | ||
| return //map is not initialized for resources without CheckCommand, so it being missing is ok | ||
| } | ||
| sendSignalToChannels(resourceChangeByResourceChans, resourceName, "checkCommandFirstChangeByResourceChans", struct{}{}) | ||
| } | ||
| func (rm ResourceManager) broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked(resourceName string, recheckNeeded bool) { | ||
| serviceChannels, ok := rm.resourceChangeByResourceChans[resourceName] | ||
| if !ok { | ||
| log.Printf("[Resource Monitor][%s] ERROR: resourceChangeByResourceChans map is not initialized", resourceName) | ||
| return | ||
| } | ||
| sendSignalToChannels(serviceChannels, resourceName, "resourceChangeByResourceChans", recheckNeeded) | ||
| } |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions use a receiver value type instead of a pointer receiver. Since they operate on and modify state in the ResourceManager (via locking and channel operations), they should use pointer receivers (*ResourceManager) for consistency with other methods like storeRunningService and incrementConnection.
| resourceManager.resourceChangeByResourceMutex.Lock() | ||
| if len(resourceManager.resourceChangeByResourceChans[resourceName]) > 0 || len(resourceManager.checkCommandFirstChangeByResourceChans[resourceName]) > 0 { | ||
| timer.Reset(checkInterval) | ||
| } | ||
| resourceManager.resourceChangeByResourceMutex.Unlock() |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if there are items in the channels by checking the length, but channels can become empty between the check and the timer reset. This could lead to the timer not being reset when it should be. Consider using a more robust signaling mechanism or ensuring the check and reset are properly synchronized.
| resourceManager.resourceChangeByResourceMutex.Lock() | ||
| //This might be not necessary if there is no CheckCommand | ||
| for _, resourceToCheck := range resourceManager.checkCommandFirstChangeByResourceChans { | ||
| delete(resourceToCheck, requestingService) | ||
| } | ||
| resourceManager.resourceChangeByResourceMutex.Unlock() |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment at line 1085 states "This might be not necessary if there is no CheckCommand" which indicates uncertainty about the code's behavior. Either the cleanup is necessary or it isn't - this should be clarified and the comment updated to explain when and why this cleanup is needed.
No description provided.