Skip to content

Conversation

@jhelwig
Copy link
Contributor

@jhelwig jhelwig commented Nov 26, 2025

Problem

S3 throttling happens frequently in production, generating three log messages per throttle event: "dual write failed", "enqueued for retry", and "retry successful". This operational noise makes it difficult to spot genuine problems in production logs and prevents creating useful retry queue alerts without constant false alarms. Log volume also impacts Honeycomb with rate limiting and noise.

The current system writes to S3 as fast as possible and handles throttling reactively. The dual-write implementation uses tokio::join! to write to both PostgreSQL and S3 in parallel. When S3 throttling occurs, the native SDK retry logic blocks the entire write operation, making PostgreSQL writes wait for S3's exponential backoff to complete. Since S3 is almost always the slower backend during capacity ramp-up, this significantly degrades layer cache write performance.

Impact of Current State

  • Performance degradation: All layer cache writes slowed to S3's speed (slowest backend) due to SDK retry blocking. PostgreSQL writes that could complete in milliseconds wait for S3's multi-second retry cycles.
  • Limited observability: No visibility into S3 backoff state or queue depth for meaningful alerts
  • Reactive approach: Constantly hitting S3 limits rather than writing at sustainable pace

Solution

This PR shifts from "write as fast as possible, handle failures reactively" to "write at sustainable pace with adaptive rate limiting." It introduces three new components:

S3WriteQueue: Persistent disk-based queue using Postcard serialization and ULID ordering for chronological processing. All S3 writes are atomically persisted to disk before acknowledgment, ensuring no data loss even during crashes. Non-retryable errors (serialization failures from corrupted data) are moved to a dead letter queue to preserve evidence while preventing corrupted writes from blocking queue processing.

RateLimiter: Adaptive rate limiting that adjusts delay between S3 writes based on throttling responses. Increases delay on throttling (503), decreases delay after consecutive successes. Uses learning rate adaptation - growing on repeated throttling (larger adjustment steps to reduce request rate quickly), shrinking during recovery (smaller steps to validate stability at each rate level before ramping up).

S3QueueProcessor: Background task that dequeues oldest writes, applies backoff delays, and writes to S3. Separate S3 clients with different retry configurations: reads use SDK retry for resilience, writes use disabled retry because the queue provides application-level retry (prevents double-retry and timeout multiplication).

After this change:

  • PostgreSQL writes proceed at full speed (not blocked by S3 throttling)
  • All S3 writes go through persistent queue with background processing
  • Throttling logged at DEBUG level (not WARN) since it's handled automatically
  • Operators can graph backoff state and queue depth, create meaningful alerts
  • System writes at sustainable pace rather than constantly hitting limits

Design Rationale

Why background queue vs fixing SDK retry blocking?

The SDK retry blocking is fundamental to how tokio::join! works - it waits for all futures to complete. We could disable SDK retry entirely, but then we lose retry behavior for reads and need to handle all transient S3 errors ourselves. The queue-based approach gives us both: SDK retry for resilient reads, application-level retry for writes without blocking PostgreSQL.

Why are all writes persisted to disk (even when queue is empty)?

An immediate write could receive 503 from S3 even when the queue is empty. If the process crashes between receiving the 503 and persisting to the queue, the write is lost permanently with no recovery path. Disk persistence before acknowledgment ensures zero data loss, even during crashes. The performance overhead of hitting disk is acceptable because durability guarantees are non-negotiable for cache persistence.

Why adaptive learning rate approach?

Standard exponential backoff uses fixed multipliers (e.g., multiply by M on throttle, divide by N on success) which causes indefinite oscillation around the throttle threshold. Each recovery overshoots and triggers throttling again because the adjustment size never changes.

The adaptive approach adjusts the step size based on system response: growing when repeatedly throttled (larger steps needed to escape throttle zone), shrinking after successful reductions (smaller steps to find sustainable rate precisely). This reduces repeated throttle events during recovery by validating stability at each rate level before making the next adjustment.

Why separate S3 clients with different retry configs?

S3Layer reads need SDK retry for resilience. Processor writes need it disabled to prevent double-retry and timeout multiplication (queue provides application-level retry). Using same client would force choosing one behavior over the other.

Why transform keys before queueing?

Processor doesn't need to know about transformation strategies or key prefixes. It just uses whatever key is in the queue. This architectural simplification eliminates processor dependency on S3Layer's internal transformation logic.

Why ephemeral rate limiter state?

Simpler implementation, no persistence overhead. Rate limiter rediscovers appropriate limits naturally during processing after restart. Queue persistence ensures durability; rate limiter just optimizes request timing.

Why DEBUG logging for throttling vs WARN?

Different S3 errors require different handling strategies. Throttling responses (503) are expected operational conditions that the rate limiter handles automatically - they're S3 telling us we're writing too fast, not a problem requiring human intervention. Logging at WARN creates operational noise that obscures genuine problems. Serialization errors indicate corrupted data and go to ERROR (requires investigation). Everything else represents transient failures logged at WARN with backoff retry.

Deployment Notes

Behavioral Changes

  • PostgreSQL write performance improves: PostgreSQL writes no longer blocked by S3 throttling. Write latency returns to PostgreSQL speeds (milliseconds) rather than being dominated by S3 SDK retry cycles (seconds).
  • All S3 writes hit disk before acknowledgment: Durability guarantee means every write persists to queue on disk before returning success. No fast path for empty queue.
  • S3 throttling logged at DEBUG: Throttling (503) is expected operational condition handled automatically by rate limiter, not a problem requiring human intervention. Serialization errors remain ERROR, transient failures remain WARN.
  • Queue persists across restarts: Pending writes survive service restarts. Rate limiter state is ephemeral and rediscovers appropriate rate naturally during processing.

Monitoring

New metrics for alerting and graphing:

  • s3_write_queue_depth: Current pending writes (alert on sustained growth)
  • s3_write_backoff_ms: Current backoff delay (visualize rate limiting)
  • s3_write_attempts_total: Counter by result (success/throttle/error_*)
  • s3_write_duration_ms: End-to-end latency from enqueue to completion
  • s3_dlq_depth: Dead letter queue depth for corrupted writes
  • layer_cache_read_fallback: S3→PostgreSQL fallback rate

Grafana Dashboard

Complete reorganization of layer cache metrics dashboard (42 panels) into 5 workflow-based sections with improved readability and new operational metrics:

  1. OVERVIEW & HEALTH: System health scorecard with error rates, S3/retry queue depths (color-coded thresholds), write throughput balance, and cache resolution metrics
  2. READ PATH PERFORMANCE: Request latency, backend latency comparison, cache miss rates, S3→PostgreSQL fallback tracking
  3. WRITE PATH - S3 PERSISTENCE: Queue depth, backoff state, success rate, latency distribution, error breakdown by category, dead letter queue depth
  4. WRITE PATH - POSTGRES PERSISTENCE: Persister operations, persistence latency, retry queue depth and rates
  5. EVICTION OPERATIONS: Eviction rates, failures, latency, memory-only evictions

Dashboard improvements:

  • All p50/p95/p99 panels show exactly 3 data series for readability (using panel repeat and aggregation to eliminate visual noise)
  • Template variables use union queries to dynamically discover all backends (foyer, postgres, s3) and services (including read-only edda)
  • Added 8 new operational metrics panels (health scorecard, fallback tracking, error breakdown, DLQ depth, throughput balance, memory-only evictions, overall miss rate)
  • Removed 3 redundant panels (data available in other visualizations)

The graphs in the screenshot below were generated in my dev environment by sticking a simple HTTP proxy between the services and versitygw. The proxy would limit ObjectPut to 5 requests/second globally, and return a 503 when requests exceed that rate.

image

Migration Path

  • Queue files are created on first S3 write: {cache_name}_s3_queue/{ULID}.pending
  • Dead letter queue: {cache_name}_s3_queue/dead_letter/{ULID}.corrupted

Services using PostgresOnly mode are unaffected. Services using S3-enabled modes (DualWrite, S3Primary, S3Only) will create queue directories on startup.

Development Environment

Updated versitygw binary path in docker-compose from /usr/local/bin/versitygw to /app/versitygw to match current container image (container was exiting with code 127 after init-buckets.sh completed).

@github-actions github-actions bot added A-si-settings Area: Backend service settings management [Rust] A-sdf Area: Primary backend API service [Rust] A-veritech Area: Task execution backend service [Rust] A-cyclone Area: Function execution engine [Rust] A-dal A-bytes-lines-codec A-config-file A-si-test-macros A-telemetry-rs A-dal-test A-si-data-nats labels Nov 26, 2025
@github-actions
Copy link

github-actions bot commented Nov 26, 2025

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Scanned Files

None

@jhelwig jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch 6 times, most recently from 64c3f23 to d525e65 Compare November 30, 2025 20:56
@github-actions github-actions bot added the A-ci Area: CI configuration files and scripts label Nov 30, 2025
@jhelwig
Copy link
Contributor Author

jhelwig commented Dec 1, 2025

/try

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

Implements adaptive rate limiting using gradient descent algorithm to smoothly
converge on optimal S3 request rate without the indefinite oscillation that
MIMD (multiplicative increase, multiplicative decrease) exhibits.

# Algorithm

On throttle: backoff increases by `learning_rate * adjustment_size_ms`
On success: after N consecutive successes, backoff decreases by the same formula

Learning rate adapts:
- Grows on repeated throttling (larger steps to escape throttle zone)
- Shrinks during recovery (smaller steps for precision near optimal)

# Convergence

Simulations show 15-25 iterations to converge with ±7.5% oscillation around
target rate, vs MIMD's indefinite ±33% oscillation.

# Configuration

All parameters configurable via RateLimitConfig:
- adjustment_size_ms: Base step size for adjustments
- initial_learning_rate: Starting learning rate when throttled
- min/max_learning_rate: Bounds for adaptation
- learning_rate_growth: Multiplier when throttled (>1.0)
- learning_rate_shrink: Multiplier when reducing (<1.0)
- successes_before_reduction: Validation threshold
- zeno_threshold_ms: Below this, backoff jumps to zero

Includes comprehensive validation to catch misconfigurations at load time.
…ring

We cannot lose S3 writes under any circumstances - it would be catastrophic
data loss. Memory-only queues don't provide sufficient guarantees because the
AWS SDK keeps retry data in memory for minutes, and any crash during that
window loses the write permanently.

This commit implements disk-based persistence where writes are atomically
persisted before being acknowledged. We reuse the retry queue's Postcard
serialization format and ULID-based filenames for chronological ordering
because it's proven reliable in production.

Non-retryable errors (serialization failures from corrupted data) are moved to
a dead letter queue. This preserves evidence for debugging while preventing the
corrupted write from blocking queue processing indefinitely.

This design prepares for eventual PostgreSQL layer removal. S3Layer will
operate as the sole persistence layer with its own infrastructure rather than
depending on shared retry queue mechanisms.
Different S3 errors require different handling strategies. Throttling responses
(503) are expected operational conditions that should trigger backoff and be
logged at DEBUG level. Serialization errors indicate corrupted data that cannot
be retried and must go to the dead letter queue. Everything else represents
transient failures that should retry with backoff.

The key insight is that throttling isn't a problem - it's S3 telling us we're
writing too fast, and the rate limiter handles this automatically. Treating it
as WARN-level creates operational noise that obscures genuine problems.

We extract error classification to a module-level function so both S3Layer
(for read operations) and S3QueueProcessor (for writes) use consistent logic.
Classification uses HTTP status codes primarily, with fallback to error message
patterns for S3-compatible services that may not use standard status codes.
@jhelwig jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch from 92b78aa to 10b462e Compare December 1, 2025 19:45
@jhelwig
Copy link
Contributor Author

jhelwig commented Dec 1, 2025

/try

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

…ueue into S3Layer

The core architectural decision here is that ALL S3 writes go through the
persistent queue with no fast path. This might seem excessive when the queue
is empty and we're not rate-limited, but consider what happens if we write
immediately: we could receive a 503 response, and if the process crashes
before we persist the write to the queue, it's lost forever.

We accept the trade-off of every write hitting disk because durability is more
important than write performance. This becomes critical as we prepare to remove
PostgreSQL entirely - S3Layer must operate as a self-sufficient persistence
layer without relying on shared retry infrastructure.

The processor runs as a background task that dequeues the oldest writes (ULID
ordering preserves chronology), applies the current backoff delay, and writes
directly to S3. We use separate S3 clients with different retry configurations:
the S3Layer client has SDK retry enabled for resilient reads, while the
processor client has retry disabled because the queue itself provides
application-level retry. This prevents double-retry and timeout multiplication.

Keys are transformed at the API boundary before queueing. The processor doesn't
know about transformation strategies or key prefixes - it just uses whatever
key is in the queue. This architectural simplification eliminates processor
dependency on S3Layer's internal transformation logic.

On shutdown, we complete any in-flight write and exit immediately without
draining the queue. The queue persists across restarts, and the rate limiter
rediscovers appropriate limits naturally during processing. Keeping the rate
limiter state ephemeral simplifies the implementation without compromising
durability.
Moving throttling detection from WARN to DEBUG logs reduces noise, but
operators still need visibility into system behavior. Metrics provide this
visibility without cluttering logs - operators can graph backoff trends over
time, create alerts on sustained high backoff without false alarms, and
understand how queue depth correlates with backoff state.

The four panels show queue depth (detecting backpressure), backoff state
(visualizing rate limiting in action), success rate (tracking write outcomes
over time), and latency distribution (end-to-end timing from enqueue through
successful write). Together these give operators a complete picture of S3
write health without requiring log analysis.
The architecture shift from "write fast, handle failures reactively" to
"write at sustainable pace proactively" isn't obvious from code alone.
Operators need to understand how the system behaves, why we made these
trade-offs, and how to interpret what they see in production.

The README provides an operational guide: how to configure rate limiting for
different workloads, which metrics to monitor and when to alert, where queue
files live on disk, and what happens during startup and shutdown. This context
helps operators make informed decisions about tuning and troubleshooting.

Inline documentation explains the completed architecture after refactoring.
We use separate S3 clients because reads need SDK retry for resilience while
writes need it disabled to prevent double-retry. Keys are transformed before
queueing so the processor doesn't need to know about transformation strategies.
The processor uses keys directly from the queue as a deliberate architectural
simplification.
The versitygw binary moved from /usr/local/bin/versitygw to /app/versitygw
in the container image, but the docker-compose configuration wasn't updated.
This caused the container to exit with code 127 (command not found) after
the init-buckets.sh script completed.

Update the exec path to /app/versitygw to match the current container
image structure.
Complete overhaul of the layer cache metrics dashboard to improve
readability, add missing operational metrics, and organize panels by
workflow.

Template Variables:
- Add operation template for filtering persistence latency metrics
- Convert backend template from hardcoded to dynamic query discovering
  all backends (foyer, postgres, s3) automatically
- Expand service template to include read-only services (edda) using
  union query across multiple metrics

Panel Organization:
Restructured 42 panels into 5 logical sections matching operational
workflows:

1. OVERVIEW & HEALTH - System health scorecard with error rates, queue
   depths, resolution rates, and write throughput balance
2. READ PATH PERFORMANCE - Request latency, backend latency comparison,
   cache miss rates, fallback tracking
3. WRITE PATH - S3 PERSISTENCE - Queue depth, backoff state, write
   throughput, error breakdown, dead letter queue
4. WRITE PATH - POSTGRES PERSISTENCE - Persister operations,
   persistence latency, retry queue depth and rates
5. EVICTION OPERATIONS - Eviction rates, failures, latency, and
   memory-only evictions

Readability Improvements:
- All p50/p95/p99 panels now show exactly 3 data series using panel
  repeat and aggregation
- Panel 26 (Persistence Latency): Repeats by operation, aggregates
  other dimensions
- Panel 27 (Backend Read Latency): Aggregates across caches
- Panel 28 (Request Latency): Repeats by cache
- Panel 29 (Backend Hit Latency): Repeats by cache, shows all backends
  together, removed miss series

New Metrics (8 panels):
- System Health Scorecard: Overall error rate, S3 queue depth, retry
  queue depth with color-coded thresholds
- Fallback Tracking: S3→PostgreSQL fallback rate indicating data
  availability issues
- S3 Write Error Breakdown: Stacked timeseries by error category
- S3 Dead Letter Queue Depth: Corrupted write tracking
- Write Throughput Balance: Compares S3 vs PostgreSQL write rates for
  dual-write monitoring
- Memory-Only Evictions: Tracks Foyer memory evictions without disk
  persistence
- Overall Cache Miss Rate: Single aggregated panel showing complete
  misses per cache

Removed Redundant Panels (3):
- Retry Queue Success vs Failure Ratio (data visible in other panels)
- S3 Write Failures by Error Category (replaced by comprehensive
  breakdown)
- Persister Retry Operations (superseded by specific retry queue
  panels)

Technical Details:
- Overall cache miss uses layer_cache_request_latency_ms metric
  (verified in layer_cache.rs:347-351) tracking complete misses across
  all layers
- All backends explicitly report both hit and miss metrics (verified in
  layer_cache.rs:154-308)
- Template union queries ensure all backends and services appear
  dynamically
@jhelwig jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch from 10b462e to 808cf48 Compare December 1, 2025 19:58
Copy link
Contributor

@zacharyhamm zacharyhamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks pretty solid (although it's giving claude). A few comments. Will look again when moved out of draft. I'm curious: is it not possible to do s3 writes in parallel for better throughput? Although I suppose we're going to be doing lots of parallel writes for every workspace in the whole service so maybe a single queue for each layer cache makes sense

/// attempting the next operation. Returns `Duration::ZERO` when no delay
/// is needed.
pub fn current_delay(&self) -> Duration {
Duration::from_millis(self.current_backoff_ms as u64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f64 can be both larger and smaller than u64 (overflow/underflow). Do we care here? not sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max u64 milliseconds would be 5.845421e+8 years, so we should be staying well away from any over-/under-flow risk due to the clamping elsewhere.


// Write atomically (write to temp, then rename)
let temp_path = self.queue_dir.join(format!("{ulid}.tmp"));
fs::write(&temp_path, &bytes)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about fsync here? It will probably be a perf hit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have fsync here in at one point while working on it, and I keep flipping back and forth on it. The writes come through pretty quickly on an active system, so doing an fsync on every write will add up very quickly. It feels like what we'd want is an fsync timer (make sure there is one at least every N seconds), instead of on every individual write.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, we can do that just by tuning the kernel (there's a dirty page expire sysctl flag and a few other things). Linux by default will ensure pages older than 30 seconds are written to disk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this really makes me wonder if an sqlite3 database isn't a better solution than raw disk files for managing the queues. we can think about it in the future. this likely means a hard crash will leave some number of corrupted entries

Comment on lines +126 to +132
if let Err(dlq_err) = self.move_corrupted_file_to_dlq(&path, ulid) {
tracing::error!(
"Failed to move corrupted file {} to DLQ: {}",
path.display(),
dlq_err
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If moving the corrupted file to the DLQ fails, does this mean we'll encounter it again next time and attempt to process it again? not sure of a simple solution to avoid this here but seems worth calling out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the move fails, then we've likely got bigger FS issues as these should be on the same filesystem, and the move is an fs::rename, so it hould just be inode updates, not "actual" data writing. It should also leave the bad file in the original place (part of the "normal"queue), unless there are catastrophic FS issues.

#[cfg(test)]
mod tests {
use super::*;
impl Drop for S3Layer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how doing the shutdown this way will interact with the graceful shutdown handlers. Everywhere else we block with task trackers and cancellation token. There's one for layer db already.

Is there a risk we won't finish in-progress write operations before the termination completes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's always a risk, but it should be safe if that happens, since we don't remove the queue entry until after the write is successful. The possibly-not-completed write would be picked up on service restart, and attempted again.

@jhelwig
Copy link
Contributor Author

jhelwig commented Dec 1, 2025

I'm curious: is it not possible to do s3 writes in parallel for better throughput? Although I suppose we're going to be doing lots of parallel writes for every workspace in the whole service so maybe a single queue for each layer cache makes sense

Next steps to increase throughput that I can think of:

  • Rate-limited queue per prefix (S3 scales up write speeds based on the unique key prefix)
  • Rate-limited parallel writes per queue

These mainly would come into play after S3 has scaled up to the point where we're no longer getting any 503s in the serial, single, per-cache queueing, though.

@jhelwig jhelwig marked this pull request as ready for review December 1, 2025 22:25
@jhelwig
Copy link
Contributor Author

jhelwig commented Dec 1, 2025

/try

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

Copy link
Contributor

@zacharyhamm zacharyhamm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see how it works in the wild

@jhelwig jhelwig added this pull request to the merge queue Dec 1, 2025
@jhelwig jhelwig removed this pull request from the merge queue due to a manual request Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants