Skip to content

Conversation

Copy link

Copilot AI commented Sep 29, 2025

Problem

When the sandcat agent runs on devices with Zscaler ZIA or similar network interference tools, occasional network errors cause operations to freeze indefinitely without retry or failure reporting. This manifests as:

  • Failed to decode HTTP response: illegal base64 data at input byte 0
  • Failed to perform HTTP request: Post "https://xxxxx/beacon": read tcp 10.XX.XXX.XXX:58794->XXX.XXX.XX.X:443: wsarecv: A connection attempt failed...
  • Agent continues beaconing as "ALIVE" but operations never complete or timeout in the Caldera UI

Solution

This PR implements comprehensive HTTP retry logic with exponential backoff to handle temporary network failures gracefully.

Key Changes

1. Retry Configuration

  • Maximum 3 retry attempts per HTTP request
  • Exponential backoff: 2s → 4s → 8s with jitter to prevent thundering herd
  • 30-second HTTP timeout per request to prevent indefinite hangs

2. Smart Error Classification

  • Retryable network errors: connection refused, timeout, DNS failures, wsarecv errors
  • Retryable HTTP errors: 5xx server errors, 408 Request Timeout, 429 Too Many Requests
  • Non-retryable errors: 4xx client errors (authentication, not found, etc.)

3. Comprehensive Coverage
All HTTP operations now include retry logic:

  • Beacon requests (GetBeaconBytes)
  • Payload downloads (GetPayloadBytes)
  • File uploads (UploadFileBytes)
  • Execution result reporting (SendExecutionResults)

4. Enhanced Logging

  • Detailed retry attempt logging: [!] HTTP request failed (attempt 1/4): connection refused. Retrying in 2.322s
  • Success after retry: [+] HTTP request succeeded on attempt 3
  • Clear failure reporting after exhausting retries

Testing

The implementation has been thoroughly tested with a mock server that simulates network failures:

[!] HTTP request returned status 500 (attempt 1/4). Retrying in 2.322s
[!] HTTP request returned status 500 (attempt 2/4). Retrying in 4.028s
[+] HTTP request succeeded on attempt 3
[+] Beacon (HTTP): ALIVE

Impact

  • Before: Network errors caused permanent operation freezing with no indication of failure
  • After: Automatic recovery from temporary network issues with detailed logging
  • Benefit: Operations complete successfully despite network interference from Zscaler, corporate proxies, or unstable connections

This resolves the core issue where network instability would cause sandcat operations to freeze indefinitely, improving overall agent reliability in enterprise environments with network security tools.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • 224.0.0.251

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

This section details on the original issue you should resolve

<issue_title>Sandcat network resiliency </issue_title>
<issue_description>Describe the bug
When the sandcat agent is running on a device with Zscaler active, we can get occasional network errors, which stop later command processing, but do not result in any data returned to the console indicating a problem / apparent retries.

To Reproduce
Run the sandcat agent (I'm using a slightly updated version of 5.0.0) in verbose mode.
Have Zscaler ZIA enabled. I believe other tools that interfere with network traffic would have a similar impact.
Execute an operation.

Some actions (I have not yet found a commonality) result in errors like the following:
Failed to decode HTTP response: illegal base64 data at input byte 0
or similarly
Failed to perform HTTP request: Post "https://xxxxx/beacon": read tcp 10.XX.XXX.XXX:58794->XXX.XXX.XX.X:443: wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

After this occurs, we return to a status of receiving [+] Beacon (HTTP): ALIVE messages, and the agent shows as live within the Caldera UI. However the step associated with this never errors out / retries in the Caldera UI, resulting in the test 'freezing' waiting for for a status that will never come until the operation times out.

The screenshot below was taken well after 10 minutes after the last successful action listed in the operation UI.

Expected behavior
The agent should either retry the failed operation, or at a minimum have some kind of status data returned denoting a failure.

Screenshots
image

Desktop (please complete the following information):
Latest version of Chrome / Edge in use, but not really relevant since it's impacting the sandcat agent.
Sandcat agent tested on current updates for Win 10 / Win 11.

</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes #440

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 3 commits September 29, 2025 23:24
Co-authored-by: deacon-mp <61169193+deacon-mp@users.noreply.github.com>
…iency

Co-authored-by: deacon-mp <61169193+deacon-mp@users.noreply.github.com>
Co-authored-by: deacon-mp <61169193+deacon-mp@users.noreply.github.com>
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

Copilot AI changed the title [WIP] Sandcat network resiliency Add HTTP retry logic with exponential backoff for network resiliency against Zscaler and proxy interference Sep 29, 2025
Copilot AI requested a review from deacon-mp September 29, 2025 23:34
Copilot finished work on behalf of deacon-mp September 29, 2025 23:34
@deacon-mp deacon-mp requested review from Copilot and uruwhy October 6, 2025 22:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements comprehensive HTTP retry logic with exponential backoff to resolve network resiliency issues with the sandcat agent when running in environments with Zscaler or similar network interference tools.

Key changes include:

  • Addition of retry configuration with maximum 3 attempts and exponential backoff (2s → 4s → 8s with jitter)
  • Smart error classification to distinguish between retryable and non-retryable errors
  • Implementation of retry logic across all HTTP operations (beacon, payload downloads, file uploads, execution results)

Reviewed Changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
gocat/contact/api.go Core implementation of HTTP retry logic with exponential backoff, error classification functions, and retry loops for all HTTP operations
gocat/contact/contact_test.go Test cases for retry functionality including status code classification, error handling, and delay calculation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

requestBody := bytes.Buffer{}
contentType, err := createUploadForm(&requestBody, data, uploadName)
if err != nil {
return err
Copy link

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line was changed from return nil to return err, but the function signature indicates it should return an error. However, the original return nil suggests this might have been intended to return a nil error on success, not the error itself.

Copilot uses AI. Check for mistakes.
delay = maxRetryDelay
}
// Add jitter to prevent thundering herd
jitter := time.Duration(rand.Intn(1000)) * time.Millisecond
Copy link

Copilot AI Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using math/rand without seeding can produce predictable sequences. Consider using crypto/rand for better randomness or seed math/rand with rand.Seed(time.Now().UnixNano()) to ensure different jitter patterns across agent instances.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sandcat network resiliency

2 participants