Skip to content

Conversation

@perk11
Copy link
Owner

@perk11 perk11 commented Dec 28, 2025

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements event-driven resource checking to prevent the use of outdated resource check results. Instead of relying solely on periodic timer-based checks, the system now triggers resource availability checks immediately when services connect or disconnect.

Key Changes:

  • Replaced polling-based resource waiting with an event-driven channel-based system
  • Added immediate resource check triggering on service lifecycle events (connection/disconnection)
  • Upgraded Go version from 1.22 to 1.23 to support new language features (iter.Seq)

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
monitor_resources_test.go Updated test to remove outdated retry logic and added new test implementation for event-driven resource checking
monitor_resources.go Refactored monitoring to use select statement with timer and pause/resume channel, added broadcast functions for resource change events
management_api.go Updated to dereference resourcesAvailable map directly instead of through pointer
main_test.go Added new test case configuration for validating that outdated resource check results are not used
main.go Extended ResourceManager with new mutexes and channels for event-driven coordination, updated resource reservation to use channel-based waiting instead of sleep polling
go.mod Upgraded Go version to 1.23
.github/workflows/tests.yml Updated CI workflow to use Go 1.23

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


time.Sleep(3500 * time.Millisecond)

serviceOneHealthCheckResponse := getHealthcheckResponse(t, serviceTwoHealthCheckAddress)
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is named "serviceOneHealthCheckResponse" but it's actually fetching the response from serviceTwoHealthCheckAddress. This is confusing and should be renamed to "serviceTwoHealthCheckResponse" for clarity.

Copilot uses AI. Check for mistakes.
main.go Outdated
rm.storeRunningServiceNoLock(name, runningService)
if runningService.activeConnections == 0 {
log.Printf("[%s] All connections closed, sending resourceChange event", name)
rm.broadcastResourceChanges(maps.Keys(serviceConfigByName[name].ResourceRequirements))
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock: broadcastResourceChanges is called while serviceMutex is held (line 113). This method acquires resourceChangeByResourceMutex, which could lead to lock ordering issues if another goroutine tries to acquire these locks in a different order.

Copilot uses AI. Check for mistakes.
Comment on lines 141 to 148
//time.Sleep(1 * time.Second)
//statusResponse = getStatusFromManagementAPI(t, managementApiAddress)
//assertPortsAreClosed(t, []string{serviceOneHealthCheckAddress, serviceTwoHealthCheckAddress})
//verifyServiceStatus(t, statusResponse, serviceOneName, true, map[string]int{resourceName: 11})
//verifyServiceStatus(t, statusResponse, serviceTwoName, false, map[string]int{resourceName: 0})
//verifyTotalResourcesAvailable(t, statusResponse, map[string]int{resourceName: 11})
//verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 10})

Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented-out code block (lines 141-147) should be removed if it's not needed. Leaving large blocks of commented code reduces code maintainability and can cause confusion about whether it should be used or not.

Suggested change
//time.Sleep(1 * time.Second)
//statusResponse = getStatusFromManagementAPI(t, managementApiAddress)
//assertPortsAreClosed(t, []string{serviceOneHealthCheckAddress, serviceTwoHealthCheckAddress})
//verifyServiceStatus(t, statusResponse, serviceOneName, true, map[string]int{resourceName: 11})
//verifyServiceStatus(t, statusResponse, serviceTwoName, false, map[string]int{resourceName: 0})
//verifyTotalResourcesAvailable(t, statusResponse, map[string]int{resourceName: 11})
//verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 10})

Copilot uses AI. Check for mistakes.
handleNotEnoughResource(requestingService, outputError, true, resource, currentlyAvailableAmount, resourceRequirements[resource])
return &resource
}
//todo: maxwaittime
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO comment without a clear action plan or issue reference. Consider either implementing this or adding a reference to a tracking issue, or removing the comment if it's not needed.

Suggested change
//todo: maxwaittime
// NOTE: no maximum wait time is enforced here; callers are expected to wait until resources become available.

Copilot uses AI. Check for mistakes.
main.go Outdated
runningService.stderrWriter.FinalFlush()
releaseResourcesWhenServiceMutexIsLocked(service.ResourceRequirements)
delete(resourceManager.runningServices, service.Name)
resourceManager.broadcastResourceChanges(maps.Keys(service.ResourceRequirements))
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential deadlock: broadcastResourceChanges is called while serviceMutex is held (line 1458). This is the same issue as line 113 - calling a method that acquires resourceChangeByResourceMutex while holding serviceMutex can lead to lock ordering issues and potential deadlocks.

Copilot uses AI. Check for mistakes.
if !ok {
panic("pauseResumeChan closed unexpectedly, crashing to avoid infinite loop")
}
timer.Stop()
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timer is stopped (line 28) but not drained. After calling Stop(), if the timer has already fired, the channel may still contain a value. This should be drained to avoid a race condition. Consider using a select with default case to drain the channel after stopping.

Suggested change
timer.Stop()
if !timer.Stop() {
select {
case <-timer.C:
default:
}
}

Copilot uses AI. Check for mistakes.
resourceManager.resourceChangeByResourceMutex.Unlock()
}

func UnpauseResourceAvailabilityMonitoring(resourceName string) {
Copy link

Copilot AI Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function name is inconsistent with Go naming conventions. Public function names should be PascalCase, but should also follow consistent patterns. Consider renaming to "unpauseResourceAvailabilityMonitoring" (lowercase first letter) if this is meant to be package-private, or ensure it follows a consistent naming pattern with other public functions in the package.

Suggested change
func UnpauseResourceAvailabilityMonitoring(resourceName string) {
func unpauseResourceAvailabilityMonitoring(resourceName string) {

Copilot uses AI. Check for mistakes.
@perk11 perk11 changed the title Resourse check on event Resource check on event Dec 29, 2025
@perk11 perk11 force-pushed the resourse-check-on-event branch from 95af14b to 5549e36 Compare January 4, 2026 07:02
@perk11 perk11 force-pushed the resourse-check-on-event branch from 9e58013 to 96f4f99 Compare January 15, 2026 02:32
@perk11 perk11 requested a review from Copilot January 15, 2026 02:32
@perk11 perk11 force-pushed the resourse-check-on-event branch from 96f4f99 to c494ecf Compare January 15, 2026 02:33
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

README.md Outdated
- `listen_port`: Port the service listens on
- `is_running`: Whether the service is currently running
- `status`: stopped, waiting_for_resources, starting or running.
- `active_connections`: Number of active connections to the service
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation field 'active_connections' is outdated and no longer exists in the API. According to the changes in management_api.go, this should document 'waiting_connections' and 'proxied_connections' as separate fields instead.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 100 to 106
func (rm ResourceManager) broadcastResourceChanges(resources iter.Seq[string], recheckNeeded bool) {
for resource := range resources {
rm.broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked(resource, recheckNeeded)
}
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function broadcastResourceChanges calls broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked which, based on its name, expects resourceChangeByResourceMutex to already be locked. However, this function does not acquire the mutex before calling it. This function is called from incrementConnection (line 119 in main.go) and cleanUpStoppedServiceWhenServiceMutexIsLocked (line 1527 in main.go), both of which only hold serviceMutex. This will cause race conditions when accessing resourceChangeByResourceChans. Add mutex locking: acquire resourceChangeByResourceMutex at the start of this function and release it after the loop completes.

Copilot uses AI. Check for mistakes.
main.go Outdated
Comment on lines 748 to 751
if interrupted {
return nil, fmt.Errorf("interrupt signal was received")
}

Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate check of the one on lines 745-746. The second check is redundant and can be removed to simplify the code.

Suggested change
if interrupted {
return nil, fmt.Errorf("interrupt signal was received")
}

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
resourceManager.serviceMutex.Lock()
cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, false)
cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, true)
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third parameter 'true' is unclear without context. Consider extracting this to a named constant like 'shouldReleaseResources' to improve code readability.

Suggested change
cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, true)
shouldReleaseResources := true
cleanUpStoppedServiceWhenServiceMutexIsLocked(&serviceConfig, runningService, shouldReleaseResources)

Copilot uses AI. Check for mistakes.
handleNotEnoughResource(requestingService, outputError, true, resource, currentlyAvailableAmount, resourceRequirements[resource])
return &resource
}
//todo: maxwaittime
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo comment should be properly formatted with a capital 'T' and proper spacing: "TODO: max wait time".

Suggested change
//todo: maxwaittime
// TODO: max wait time

Copilot uses AI. Check for mistakes.
main.go Outdated
stopService(*findServiceConfigByName(earliestLastUsedService))
continue
}
//TODO: add resource amounts
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO comment is vague. It should specify what resource amounts need to be added and where (e.g., "TODO: add resource amounts to log message").

Suggested change
//TODO: add resource amounts
// TODO: Add requested and currently available amounts for *missingResource to the following log message.

Copilot uses AI. Check for mistakes.
config.go Outdated
Amount int `json:"Amount"`
CheckCommand string `json:"CheckCommand"`
CheckIntervalMilliseconds uint `json:"CheckIntervalMilliseconds"`
CheckIntervalMilliseconds uint `json:"CheckWhenNotEnoughIntervalMilliseconds"`
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct field name 'CheckIntervalMilliseconds' doesn't match the JSON tag 'CheckWhenNotEnoughIntervalMilliseconds'. This will cause incorrect unmarshaling. The field name should be 'CheckWhenNotEnoughIntervalMilliseconds' to match the JSON tag.

Copilot uses AI. Check for mistakes.
verifyTotalResourceUsage(t, statusResponse, map[string]int{resourceName: 0})
verifyServiceStatus(t, statusResponse, serviceOneName, ServiceStateStopped, 0, 0, map[string]int{resourceName: 0})
verifyServiceStatus(t, statusResponse, serviceTwoName, ServiceStateStopped, 0, 0, map[string]int{resourceName: 0})
verifyResourceUsage(t, statusResponse, map[string]int{resourceName: 0}, map[string]int{resourceName: 1}, map[string]int{resourceName: 0}, map[string]int{resourceName: 0})
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple consecutive map[string]int parameters make this call hard to read and error-prone. Consider using a struct with named fields (e.g., ExpectedResourceUsage{Reserved: ..., Free: ..., Used: ..., Total: ...}) to improve clarity.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +835 to 837
"healthcheck timed out, not starting another healthcheck command: not enough time left for another HealthcheckInterval(%v) in StartupTimeout(%v)",
sleepDuration,
timeout,
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message format string references variable names instead of using formatting placeholders. The message should format the duration values properly instead of printing "%dms" and "%s" literally.

Copilot uses AI. Check for mistakes.
}
}
}
if !enoughOfResource && (!firstCheckNeeded || resourceConfig.CheckCommand == "") {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the logic at line 1056, the condition checks (!firstCheckNeeded || resourceConfig.CheckCommand == "") which means resources without CheckCommand will immediately return as not enough resource without proper error logging when firstCheckNeeded is true. This seems inconsistent - resources without CheckCommand should be handled the same way regardless of the firstCheckNeeded flag.

Suggested change
if !enoughOfResource && (!firstCheckNeeded || resourceConfig.CheckCommand == "") {
if resourceConfig.CheckCommand == "" {
if !enoughOfResource {
handleNotEnoughResource(requestingService, outputError, false, resource, currentlyAvailableAmount, requiredAmount)
return &resource
}
} else if !firstCheckNeeded && !enoughOfResource {

Copilot uses AI. Check for mistakes.
Comment on lines +1130 to +1140
reservedAmount, ok := resourceManager.resourcesReserved[resource]
if !ok {
log.Printf(
"[%s] ERROR: Resource \"%s\" is missing from the list of the reserved resources. This shouldn't be happening",
requestingService,
resource,
)
reservedAmount = 0
}
enoughOfResource := requiredAmount <= currentlyAvailableAmount-reservedAmount
return currentlyAvailableAmount, enoughOfResource
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition in accessing resourceManager.resourcesReserved at line 1130. This field is accessed without holding serviceMutex, but it's modified in other places while holding serviceMutex. This could lead to inconsistent reads or race detector warnings. The access should be protected by acquiring serviceMutex first.

Copilot uses AI. Check for mistakes.
Comment on lines +1061 to +1062
//This is not optimal since it is blocked until the first resource change, which could take longer than the consequent ones.
//But in a scenario when there are enough resources, we need to wait for them all anyway, and if it's not enough, well, we're still waiting.
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment "This is not optimal since it is blocked until the first resource change" describes a known limitation but doesn't explain the impact or why this approach was chosen. Consider expanding this comment to explain the tradeoffs and whether this is acceptable for the current use case.

Suggested change
//This is not optimal since it is blocked until the first resource change, which could take longer than the consequent ones.
//But in a scenario when there are enough resources, we need to wait for them all anyway, and if it's not enough, well, we're still waiting.
// This is not optimal since it is blocked until the first resource change for each
// resource, which could take longer than some of the subsequent changes. We accept
// this coarse‑grained blocking because:
// * this path is only used for resources with a CheckCommand, and the number of
// such resources per service is expected to be small;
// * keeping the logic as a simple "wait for the first change, then re‑evaluate"
// avoids more complex, concurrent coordination and reduces the chance of subtle
// race conditions in the resource manager; and
// * in the common case where enough resources eventually become available, the
// service must wait for all required resources anyway, so the additional
// latency from this blocking approach is acceptable for our current workloads.

Copilot uses AI. Check for mistakes.
}
}
time.Sleep(100 * time.Millisecond) //Give lmp time to catch up
//bug in the actual code here?
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at line 198 has a potential issue. It states "bug in the actual code here?" which is unclear. This comment should either be clarified with specific details about what bug is suspected, or it should be removed if the bug was already addressed.

Suggested change
//bug in the actual code here?

Copilot uses AI. Check for mistakes.
}

func findFirstMissingResourceWhenServiceMutexIsLocked(resourceRequirements map[string]int, requestingService string, outputError bool) *string {
func findFirstMissingResourceWhenServiceMutexIsLocked(resourceRequirements map[string]int, requestingService string, outputError bool, firstCheckNeeded bool) *string {
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name 'firstCheckNeeded' is confusing and doesn't clearly convey its purpose. Based on the usage, it appears to control whether to wait for the first check command result. A clearer name would be 'shouldWaitForCheckCommand' or 'needsInitialCheckCommand'.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +70
free = resourceManager.resourcesAvailable[resourceName]
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a potential race condition when accessing resourceManager.resourcesAvailable at line 69. This field should be protected by resourcesAvailableMutex according to the comment in the struct definition, but here it's accessed while holding serviceMutex. Either acquire resourcesAvailableMutex before reading, or ensure the locking strategy is documented and consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +121
func (rm ResourceManager) broadcastFirstChangeIfMutexIsLocked(resourceName string) {
resourceChangeByResourceChans, ok := rm.checkCommandFirstChangeByResourceChans[resourceName]
if !ok {
return //map is not initialized for resources without CheckCommand, so it being missing is ok
}
sendSignalToChannels(resourceChangeByResourceChans, resourceName, "checkCommandFirstChangeByResourceChans", struct{}{})
}
func (rm ResourceManager) broadcastResourceChangeWhenResourceChangeByResourceMutexIsLocked(resourceName string, recheckNeeded bool) {
serviceChannels, ok := rm.resourceChangeByResourceChans[resourceName]
if !ok {
log.Printf("[Resource Monitor][%s] ERROR: resourceChangeByResourceChans map is not initialized", resourceName)
return
}
sendSignalToChannels(serviceChannels, resourceName, "resourceChangeByResourceChans", recheckNeeded)
}
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions use a receiver value type instead of a pointer receiver. Since they operate on and modify state in the ResourceManager (via locking and channel operations), they should use pointer receivers (*ResourceManager) for consistency with other methods like storeRunningService and incrementConnection.

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +38
resourceManager.resourceChangeByResourceMutex.Lock()
if len(resourceManager.resourceChangeByResourceChans[resourceName]) > 0 || len(resourceManager.checkCommandFirstChangeByResourceChans[resourceName]) > 0 {
timer.Reset(checkInterval)
}
resourceManager.resourceChangeByResourceMutex.Unlock()
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if there are items in the channels by checking the length, but channels can become empty between the check and the timer reset. This could lead to the timer not being reset when it should be. Consider using a more robust signaling mechanism or ensuring the check and reset are properly synchronized.

Copilot uses AI. Check for mistakes.
Comment on lines +1084 to +1089
resourceManager.resourceChangeByResourceMutex.Lock()
//This might be not necessary if there is no CheckCommand
for _, resourceToCheck := range resourceManager.checkCommandFirstChangeByResourceChans {
delete(resourceToCheck, requestingService)
}
resourceManager.resourceChangeByResourceMutex.Unlock()
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment at line 1085 states "This might be not necessary if there is no CheckCommand" which indicates uncertainty about the code's behavior. Either the cleanup is necessary or it isn't - this should be clarified and the comment updated to explain when and why this cleanup is needed.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants