Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 20, 2025

Proposed changes

AgentRun RPC calls fail with "code = Unavailable desc = transport is closing" when agents are temporarily unavailable or busy, especially under high concurrent load. This results in missed job executions.

Implementation

Configuration (3 new settings, all optional):

  • agent-run-max-retries (default: 3) - Maximum retry attempts
  • agent-run-retry-initial-interval (default: 1s) - Initial backoff
  • agent-run-retry-max-interval (default: 30s) - Backoff cap

Retry Logic:

  • Exponential backoff: 1s → 2s → 4s → 8s → ... (capped at max interval)
  • Retries only on transient gRPC codes: Unavailable, DeadlineExceeded, ResourceExhausted, Aborted, Internal
  • Falls back to string matching for non-gRPC network errors (connection refused, broken pipe, etc.)
  • Non-retryable errors (InvalidArgument, NotFound) fail immediately

Usage:

dkron agent --agent-run-max-retries=5 \
            --agent-run-retry-initial-interval=2s \
            --agent-run-retry-max-interval=60s

Or via config file:

agent-run-max-retries: 5
agent-run-retry-initial-interval: 2s
agent-run-retry-max-interval: 60s

Testing

  • 14 unit test cases covering gRPC status codes and string-based error detection
  • Safe backoff calculation prevents integer overflow
  • CodeQL scan clean

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation Update (if none of the other choices apply)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 127.0.0.10
    • Triggering command: /tmp/go-build3328612238/b001/dkron.test /tmp/go-build3328612238/b001/dkron.test -test.testlogfile=/tmp/go-build3328612238/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true -test.run=TestGRPCExecutionDone -trimpath ux_amd64/vet -p d/autoscaling/v2-atomic -lang=go1.23 ux_amd64/vet -W kg_.a om/golang/protob-ifaceassert ux_amd64/vet . ateway/v2/utilit-atomic --64 ux_amd64/vet (packet block)
    • Triggering command: /tmp/go-build264584732/b001/dkron.test /tmp/go-build264584732/b001/dkron.test -test.testlogfile=/tmp/go-build264584732/b001/testlog.txt -test.paniconexit0 -test.run=TestAgent|TestJob|TestScheduler -test.timeout=5m0s n4Jnfa_9W .cfg ux_amd64/vet s.go nt.go -lang=go1.24 ux_amd64/vet -o elemetry.io/cont-errorsas .cfg ux_amd64/vet -p ery/pkg/util/man--norc -lang=go1.21 ux_amd64/vet (packet block)
    • Triggering command: /tmp/go-build3508022003/b001/dkron.test /tmp/go-build3508022003/b001/dkron.test -test.testlogfile=/tmp/go-build3508022003/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ctstate INVALID,NEW -j DROP -p github.com/moderls-files -lang=go1.12 ux_amd64/vet (packet block)
  • 127.0.0.11
    • Triggering command: /tmp/go-build264584732/b001/dkron.test /tmp/go-build264584732/b001/dkron.test -test.testlogfile=/tmp/go-build264584732/b001/testlog.txt -test.paniconexit0 -test.run=TestAgent|TestJob|TestScheduler -test.timeout=5m0s n4Jnfa_9W .cfg ux_amd64/vet s.go nt.go -lang=go1.24 ux_amd64/vet -o elemetry.io/cont-errorsas .cfg ux_amd64/vet -p ery/pkg/util/man--norc -lang=go1.21 ux_amd64/vet (packet block)
    • Triggering command: /tmp/go-build3508022003/b001/dkron.test /tmp/go-build3508022003/b001/dkron.test -test.testlogfile=/tmp/go-build3508022003/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ctstate INVALID,NEW -j DROP -p github.com/moderls-files -lang=go1.12 ux_amd64/vet (packet block)
    • Triggering command: /tmp/go-build2409181327/b001/dkron.test /tmp/go-build2409181327/b001/dkron.test -test.testlogfile=/tmp/go-build2409181327/b001/testlog.txt -test.paniconexit0 -test.run=Test_getTargetNodes -test.timeout=2m0s (packet block)
  • 127.0.0.13
    • Triggering command: /tmp/go-build3508022003/b001/dkron.test /tmp/go-build3508022003/b001/dkron.test -test.testlogfile=/tmp/go-build3508022003/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ctstate INVALID,NEW -j DROP -p github.com/moderls-files -lang=go1.12 ux_amd64/vet (packet block)
  • 127.0.0.14
    • Triggering command: /tmp/go-build3508022003/b001/dkron.test /tmp/go-build3508022003/b001/dkron.test -test.testlogfile=/tmp/go-build3508022003/b001/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ctstate INVALID,NEW -j DROP -p github.com/moderls-files -lang=go1.12 ux_amd64/vet (packet block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>AgentRun: rpc error: code = Unavailable desc = transport is closing</issue_title>
<issue_description>Describe the bug
AgentRun rpc calls sometimes fail on a selected node.

This is where this error comes from: https://github.com/distribworks/dkron/blob/master/dkron/grpc_agent.go#L92

because that one fails:
https://github.com/distribworks/dkron/blob/master/dkron/grpc_agent.go#L78

Which makes me think, that RPC call simply fails to an agent node there:
https://github.com/distribworks/dkron/blob/master/plugin/executor.go#L78

Also during that moment few other job executions intents on same node fail with same error.
I guess, some retries could be implemented there. Maybe fail happens, because agent can be busy with something at the moment, resulting in timeout firing on a caller side.

Error happens not often, but this results in missed executions.
Most probably that happens when there are lots of job execution intents at the same time to the same agent. (which is our case)

</issue_description>

<agent_instructions>Analyze the issue and propose a solution</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@coderabbitai
Copy link

coderabbitai bot commented Dec 20, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch copilot/fix-rpc-call-errors

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI and others added 4 commits December 20, 2025 10:50
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
…ulation

Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix rpc error handling in AgentRun calls Add retry logic with exponential backoff for AgentRun RPC calls Dec 20, 2025
Copilot AI requested a review from vcastellm December 20, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AgentRun: rpc error: code = Unavailable desc = transport is closing

2 participants