S3Layer queued writes #7855

jhelwig · 2025-11-26T17:11:50Z

Problem

S3 throttling happens frequently in production, generating three log messages per throttle event: "dual write failed", "enqueued for retry", and "retry successful". This operational noise makes it difficult to spot genuine problems in production logs and prevents creating useful retry queue alerts without constant false alarms. Log volume also impacts Honeycomb with rate limiting and noise.

The current system writes to S3 as fast as possible and handles throttling reactively. The dual-write implementation uses tokio::join! to write to both PostgreSQL and S3 in parallel. When S3 throttling occurs, the native SDK retry logic blocks the entire write operation, making PostgreSQL writes wait for S3's exponential backoff to complete. Since S3 is almost always the slower backend during capacity ramp-up, this significantly degrades layer cache write performance.

Impact of Current State

Performance degradation: All layer cache writes slowed to S3's speed (slowest backend) due to SDK retry blocking. PostgreSQL writes that could complete in milliseconds wait for S3's multi-second retry cycles.
Limited observability: No visibility into S3 backoff state or queue depth for meaningful alerts
Reactive approach: Constantly hitting S3 limits rather than writing at sustainable pace

Solution

This PR shifts from "write as fast as possible, handle failures reactively" to "write at sustainable pace with adaptive rate limiting." It introduces three new components:

S3WriteQueue: Persistent disk-based queue using Postcard serialization and ULID ordering for chronological processing. All S3 writes are atomically persisted to disk before acknowledgment, ensuring no data loss even during crashes. Non-retryable errors (serialization failures from corrupted data) are moved to a dead letter queue to preserve evidence while preventing corrupted writes from blocking queue processing.

RateLimiter: Adaptive rate limiting that adjusts delay between S3 writes based on throttling responses. Increases delay on throttling (503), decreases delay after consecutive successes. Uses learning rate adaptation - growing on repeated throttling (larger adjustment steps to reduce request rate quickly), shrinking during recovery (smaller steps to validate stability at each rate level before ramping up).

S3QueueProcessor: Background task that dequeues oldest writes, applies backoff delays, and writes to S3. Separate S3 clients with different retry configurations: reads use SDK retry for resilience, writes use disabled retry because the queue provides application-level retry (prevents double-retry and timeout multiplication).

After this change:

PostgreSQL writes proceed at full speed (not blocked by S3 throttling)
All S3 writes go through persistent queue with background processing
Throttling logged at DEBUG level (not WARN) since it's handled automatically
Operators can graph backoff state and queue depth, create meaningful alerts
System writes at sustainable pace rather than constantly hitting limits

Design Rationale

Why background queue vs fixing SDK retry blocking?

The SDK retry blocking is fundamental to how tokio::join! works - it waits for all futures to complete. We could disable SDK retry entirely, but then we lose retry behavior for reads and need to handle all transient S3 errors ourselves. The queue-based approach gives us both: SDK retry for resilient reads, application-level retry for writes without blocking PostgreSQL.

Why are all writes persisted to disk (even when queue is empty)?

An immediate write could receive 503 from S3 even when the queue is empty. If the process crashes between receiving the 503 and persisting to the queue, the write is lost permanently with no recovery path. Disk persistence before acknowledgment ensures zero data loss, even during crashes. The performance overhead of hitting disk is acceptable because durability guarantees are non-negotiable for cache persistence.

Why adaptive learning rate approach?

Standard exponential backoff uses fixed multipliers (e.g., multiply by M on throttle, divide by N on success) which causes indefinite oscillation around the throttle threshold. Each recovery overshoots and triggers throttling again because the adjustment size never changes.

The adaptive approach adjusts the step size based on system response: growing when repeatedly throttled (larger steps needed to escape throttle zone), shrinking after successful reductions (smaller steps to find sustainable rate precisely). This reduces repeated throttle events during recovery by validating stability at each rate level before making the next adjustment.

Why separate S3 clients with different retry configs?

S3Layer reads need SDK retry for resilience. Processor writes need it disabled to prevent double-retry and timeout multiplication (queue provides application-level retry). Using same client would force choosing one behavior over the other.

Why transform keys before queueing?

Processor doesn't need to know about transformation strategies or key prefixes. It just uses whatever key is in the queue. This architectural simplification eliminates processor dependency on S3Layer's internal transformation logic.

Why ephemeral rate limiter state?

Simpler implementation, no persistence overhead. Rate limiter rediscovers appropriate limits naturally during processing after restart. Queue persistence ensures durability; rate limiter just optimizes request timing.

Why DEBUG logging for throttling vs WARN?

Different S3 errors require different handling strategies. Throttling responses (503) are expected operational conditions that the rate limiter handles automatically - they're S3 telling us we're writing too fast, not a problem requiring human intervention. Logging at WARN creates operational noise that obscures genuine problems. Serialization errors indicate corrupted data and go to ERROR (requires investigation). Everything else represents transient failures logged at WARN with backoff retry.

Deployment Notes

Behavioral Changes

PostgreSQL write performance improves: PostgreSQL writes no longer blocked by S3 throttling. Write latency returns to PostgreSQL speeds (milliseconds) rather than being dominated by S3 SDK retry cycles (seconds).
All S3 writes hit disk before acknowledgment: Durability guarantee means every write persists to queue on disk before returning success. No fast path for empty queue.
S3 throttling logged at DEBUG: Throttling (503) is expected operational condition handled automatically by rate limiter, not a problem requiring human intervention. Serialization errors remain ERROR, transient failures remain WARN.
Queue persists across restarts: Pending writes survive service restarts. Rate limiter state is ephemeral and rediscovers appropriate rate naturally during processing.

Monitoring

New metrics for alerting and graphing:

s3_write_queue_depth: Current pending writes (alert on sustained growth)
s3_write_backoff_ms: Current backoff delay (visualize rate limiting)
s3_write_attempts_total: Counter by result (success/throttle/error_*)
s3_write_duration_ms: End-to-end latency from enqueue to completion
s3_dlq_depth: Dead letter queue depth for corrupted writes
layer_cache_read_fallback: S3→PostgreSQL fallback rate

Grafana Dashboard

Complete reorganization of layer cache metrics dashboard (42 panels) into 5 workflow-based sections with improved readability and new operational metrics:

OVERVIEW & HEALTH: System health scorecard with error rates, S3/retry queue depths (color-coded thresholds), write throughput balance, and cache resolution metrics
READ PATH PERFORMANCE: Request latency, backend latency comparison, cache miss rates, S3→PostgreSQL fallback tracking
WRITE PATH - S3 PERSISTENCE: Queue depth, backoff state, success rate, latency distribution, error breakdown by category, dead letter queue depth
WRITE PATH - POSTGRES PERSISTENCE: Persister operations, persistence latency, retry queue depth and rates
EVICTION OPERATIONS: Eviction rates, failures, latency, memory-only evictions

Dashboard improvements:

All p50/p95/p99 panels show exactly 3 data series for readability (using panel repeat and aggregation to eliminate visual noise)
Template variables use union queries to dynamically discover all backends (foyer, postgres, s3) and services (including read-only edda)
Added 8 new operational metrics panels (health scorecard, fallback tracking, error breakdown, DLQ depth, throughput balance, memory-only evictions, overall miss rate)
Removed 3 redundant panels (data available in other visualizations)

The graphs in the screenshot below were generated in my dev environment by sticking a simple HTTP proxy between the services and versitygw. The proxy would limit ObjectPut to 5 requests/second globally, and return a 503 when requests exceed that rate.

Migration Path

Queue files are created on first S3 write: {cache_name}_s3_queue/{ULID}.pending
Dead letter queue: {cache_name}_s3_queue/dead_letter/{ULID}.corrupted

Services using PostgresOnly mode are unaffected. Services using S3-enabled modes (DualWrite, S3Primary, S3Only) will create queue directories on startup.

Development Environment

Updated versitygw binary path in docker-compose from /usr/local/bin/versitygw to /app/versitygw to match current container image (container was exiting with code 127 after init-buckets.sh completed).

github-actions · 2025-11-26T17:12:10Z

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Scanned Files

None

jhelwig · 2025-12-01T19:01:13Z

/try

github-actions · 2025-12-01T19:01:33Z

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

Implements adaptive rate limiting using gradient descent algorithm to smoothly converge on optimal S3 request rate without the indefinite oscillation that MIMD (multiplicative increase, multiplicative decrease) exhibits. # Algorithm On throttle: backoff increases by `learning_rate * adjustment_size_ms` On success: after N consecutive successes, backoff decreases by the same formula Learning rate adapts: - Grows on repeated throttling (larger steps to escape throttle zone) - Shrinks during recovery (smaller steps for precision near optimal) # Convergence Simulations show 15-25 iterations to converge with ±7.5% oscillation around target rate, vs MIMD's indefinite ±33% oscillation. # Configuration All parameters configurable via RateLimitConfig: - adjustment_size_ms: Base step size for adjustments - initial_learning_rate: Starting learning rate when throttled - min/max_learning_rate: Bounds for adaptation - learning_rate_growth: Multiplier when throttled (>1.0) - learning_rate_shrink: Multiplier when reducing (<1.0) - successes_before_reduction: Validation threshold - zeno_threshold_ms: Below this, backoff jumps to zero Includes comprehensive validation to catch misconfigurations at load time.

…ring We cannot lose S3 writes under any circumstances - it would be catastrophic data loss. Memory-only queues don't provide sufficient guarantees because the AWS SDK keeps retry data in memory for minutes, and any crash during that window loses the write permanently. This commit implements disk-based persistence where writes are atomically persisted before being acknowledged. We reuse the retry queue's Postcard serialization format and ULID-based filenames for chronological ordering because it's proven reliable in production. Non-retryable errors (serialization failures from corrupted data) are moved to a dead letter queue. This preserves evidence for debugging while preventing the corrupted write from blocking queue processing indefinitely. This design prepares for eventual PostgreSQL layer removal. S3Layer will operate as the sole persistence layer with its own infrastructure rather than depending on shared retry queue mechanisms.

Different S3 errors require different handling strategies. Throttling responses (503) are expected operational conditions that should trigger backoff and be logged at DEBUG level. Serialization errors indicate corrupted data that cannot be retried and must go to the dead letter queue. Everything else represents transient failures that should retry with backoff. The key insight is that throttling isn't a problem - it's S3 telling us we're writing too fast, and the rate limiter handles this automatically. Treating it as WARN-level creates operational noise that obscures genuine problems. We extract error classification to a module-level function so both S3Layer (for read operations) and S3QueueProcessor (for writes) use consistent logic. Classification uses HTTP status codes primarily, with fallback to error message patterns for S3-compatible services that may not use standard status codes.

jhelwig · 2025-12-01T19:46:48Z

/try

github-actions · 2025-12-01T19:47:11Z

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

…ueue into S3Layer The core architectural decision here is that ALL S3 writes go through the persistent queue with no fast path. This might seem excessive when the queue is empty and we're not rate-limited, but consider what happens if we write immediately: we could receive a 503 response, and if the process crashes before we persist the write to the queue, it's lost forever. We accept the trade-off of every write hitting disk because durability is more important than write performance. This becomes critical as we prepare to remove PostgreSQL entirely - S3Layer must operate as a self-sufficient persistence layer without relying on shared retry infrastructure. The processor runs as a background task that dequeues the oldest writes (ULID ordering preserves chronology), applies the current backoff delay, and writes directly to S3. We use separate S3 clients with different retry configurations: the S3Layer client has SDK retry enabled for resilient reads, while the processor client has retry disabled because the queue itself provides application-level retry. This prevents double-retry and timeout multiplication. Keys are transformed at the API boundary before queueing. The processor doesn't know about transformation strategies or key prefixes - it just uses whatever key is in the queue. This architectural simplification eliminates processor dependency on S3Layer's internal transformation logic. On shutdown, we complete any in-flight write and exit immediately without draining the queue. The queue persists across restarts, and the rate limiter rediscovers appropriate limits naturally during processing. Keeping the rate limiter state ephemeral simplifies the implementation without compromising durability.

Moving throttling detection from WARN to DEBUG logs reduces noise, but operators still need visibility into system behavior. Metrics provide this visibility without cluttering logs - operators can graph backoff trends over time, create alerts on sustained high backoff without false alarms, and understand how queue depth correlates with backoff state. The four panels show queue depth (detecting backpressure), backoff state (visualizing rate limiting in action), success rate (tracking write outcomes over time), and latency distribution (end-to-end timing from enqueue through successful write). Together these give operators a complete picture of S3 write health without requiring log analysis.

The architecture shift from "write fast, handle failures reactively" to "write at sustainable pace proactively" isn't obvious from code alone. Operators need to understand how the system behaves, why we made these trade-offs, and how to interpret what they see in production. The README provides an operational guide: how to configure rate limiting for different workloads, which metrics to monitor and when to alert, where queue files live on disk, and what happens during startup and shutdown. This context helps operators make informed decisions about tuning and troubleshooting. Inline documentation explains the completed architecture after refactoring. We use separate S3 clients because reads need SDK retry for resilience while writes need it disabled to prevent double-retry. Keys are transformed before queueing so the processor doesn't need to know about transformation strategies. The processor uses keys directly from the queue as a deliberate architectural simplification.

The versitygw binary moved from /usr/local/bin/versitygw to /app/versitygw in the container image, but the docker-compose configuration wasn't updated. This caused the container to exit with code 127 (command not found) after the init-buckets.sh script completed. Update the exec path to /app/versitygw to match the current container image structure.

Complete overhaul of the layer cache metrics dashboard to improve readability, add missing operational metrics, and organize panels by workflow. Template Variables: - Add operation template for filtering persistence latency metrics - Convert backend template from hardcoded to dynamic query discovering all backends (foyer, postgres, s3) automatically - Expand service template to include read-only services (edda) using union query across multiple metrics Panel Organization: Restructured 42 panels into 5 logical sections matching operational workflows: 1. OVERVIEW & HEALTH - System health scorecard with error rates, queue depths, resolution rates, and write throughput balance 2. READ PATH PERFORMANCE - Request latency, backend latency comparison, cache miss rates, fallback tracking 3. WRITE PATH - S3 PERSISTENCE - Queue depth, backoff state, write throughput, error breakdown, dead letter queue 4. WRITE PATH - POSTGRES PERSISTENCE - Persister operations, persistence latency, retry queue depth and rates 5. EVICTION OPERATIONS - Eviction rates, failures, latency, and memory-only evictions Readability Improvements: - All p50/p95/p99 panels now show exactly 3 data series using panel repeat and aggregation - Panel 26 (Persistence Latency): Repeats by operation, aggregates other dimensions - Panel 27 (Backend Read Latency): Aggregates across caches - Panel 28 (Request Latency): Repeats by cache - Panel 29 (Backend Hit Latency): Repeats by cache, shows all backends together, removed miss series New Metrics (8 panels): - System Health Scorecard: Overall error rate, S3 queue depth, retry queue depth with color-coded thresholds - Fallback Tracking: S3→PostgreSQL fallback rate indicating data availability issues - S3 Write Error Breakdown: Stacked timeseries by error category - S3 Dead Letter Queue Depth: Corrupted write tracking - Write Throughput Balance: Compares S3 vs PostgreSQL write rates for dual-write monitoring - Memory-Only Evictions: Tracks Foyer memory evictions without disk persistence - Overall Cache Miss Rate: Single aggregated panel showing complete misses per cache Removed Redundant Panels (3): - Retry Queue Success vs Failure Ratio (data visible in other panels) - S3 Write Failures by Error Category (replaced by comprehensive breakdown) - Persister Retry Operations (superseded by specific retry queue panels) Technical Details: - Overall cache miss uses layer_cache_request_latency_ms metric (verified in layer_cache.rs:347-351) tracking complete misses across all layers - All backends explicitly report both hit and miss metrics (verified in layer_cache.rs:154-308) - Template union queries ensure all backends and services appear dynamically

zacharyhamm

This code looks pretty solid (although it's giving claude). A few comments. Will look again when moved out of draft. I'm curious: is it not possible to do s3 writes in parallel for better throughput? Although I suppose we're going to be doing lots of parallel writes for every workspace in the whole service so maybe a single queue for each layer cache makes sense

zacharyhamm · 2025-12-01T20:21:14Z

lib/si-layer-cache/src/rate_limiter.rs

+    /// attempting the next operation. Returns `Duration::ZERO` when no delay
+    /// is needed.
+    pub fn current_delay(&self) -> Duration {
+        Duration::from_millis(self.current_backoff_ms as u64)


f64 can be both larger and smaller than u64 (overflow/underflow). Do we care here? not sure

Max u64 milliseconds would be 5.845421e+8 years, so we should be staying well away from any over-/under-flow risk due to the clamping elsewhere.

zacharyhamm · 2025-12-01T20:30:53Z

lib/si-layer-cache/src/s3_write_queue.rs

+
+        // Write atomically (write to temp, then rename)
+        let temp_path = self.queue_dir.join(format!("{ulid}.tmp"));
+        fs::write(&temp_path, &bytes)?;


Do we care about fsync here? It will probably be a perf hit

I did have fsync here in at one point while working on it, and I keep flipping back and forth on it. The writes come through pretty quickly on an active system, so doing an fsync on every write will add up very quickly. It feels like what we'd want is an fsync timer (make sure there is one at least every N seconds), instead of on every individual write.

Eh, we can do that just by tuning the kernel (there's a dirty page expire sysctl flag and a few other things). Linux by default will ensure pages older than 30 seconds are written to disk

this really makes me wonder if an sqlite3 database isn't a better solution than raw disk files for managing the queues. we can think about it in the future. this likely means a hard crash will leave some number of corrupted entries

zacharyhamm · 2025-12-01T20:47:55Z

lib/si-layer-cache/src/s3_write_queue.rs

+                    if let Err(dlq_err) = self.move_corrupted_file_to_dlq(&path, ulid) {
+                        tracing::error!(
+                            "Failed to move corrupted file {} to DLQ: {}",
+                            path.display(),
+                            dlq_err
+                        );
+                    }


If moving the corrupted file to the DLQ fails, does this mean we'll encounter it again next time and attempt to process it again? not sure of a simple solution to avoid this here but seems worth calling out

If the move fails, then we've likely got bigger FS issues as these should be on the same filesystem, and the move is an fs::rename, so it hould just be inode updates, not "actual" data writing. It should also leave the bad file in the original place (part of the "normal"queue), unless there are catastrophic FS issues.

sprutton1 · 2025-12-01T20:57:39Z

lib/si-layer-cache/src/s3.rs

-#[cfg(test)]
-mod tests {
-    use super::*;
+impl Drop for S3Layer {


I'm curious how doing the shutdown this way will interact with the graceful shutdown handlers. Everywhere else we block with task trackers and cancellation token. There's one for layer db already.

Is there a risk we won't finish in-progress write operations before the termination completes?

There's always a risk, but it should be safe if that happens, since we don't remove the queue entry until after the write is successful. The possibly-not-completed write would be picked up on service restart, and attempted again.

jhelwig · 2025-12-01T21:22:32Z

I'm curious: is it not possible to do s3 writes in parallel for better throughput? Although I suppose we're going to be doing lots of parallel writes for every workspace in the whole service so maybe a single queue for each layer cache makes sense

Next steps to increase throughput that I can think of:

Rate-limited queue per prefix (S3 scales up write speeds based on the unique key prefix)
Rate-limited parallel writes per queue

These mainly would come into play after S3 has scaled up to the point where we're no longer getting any 503s in the serial, single, per-cache queueing, though.

jhelwig · 2025-12-01T22:25:44Z

/try

github-actions · 2025-12-01T22:25:55Z

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

zacharyhamm

let's see how it works in the wild

github-actions bot added A-si-data-pg A-si-std A-telemetry-application-rs A-pinga A-object-tree A-si-pkg A-nats-subscriber A-si-posthog-rs A-buck2-resources A-module-index A-rebaser A-si-layer-cache A-si-pool-noodle A-forklift A-edda A-luminork labels Nov 26, 2025

jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch 2 times, most recently from e33394f to 616b129 Compare November 26, 2025 18:32

jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch 6 times, most recently from 64c3f23 to d525e65 Compare November 30, 2025 20:56

github-actions bot added the A-ci Area: CI configuration files and scripts label Nov 30, 2025

jhelwig added 3 commits December 1, 2025 19:44

jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch from 92b78aa to 10b462e Compare December 1, 2025 19:45

jhelwig added 5 commits December 1, 2025 19:57

jhelwig force-pushed the jhelwig/reduce-layer-cache-throttling-noise branch from 10b462e to 808cf48 Compare December 1, 2025 19:58

zacharyhamm reviewed Dec 1, 2025

View reviewed changes

sprutton1 reviewed Dec 1, 2025

View reviewed changes

jhelwig marked this pull request as ready for review December 1, 2025 22:25

zacharyhamm approved these changes Dec 1, 2025

View reviewed changes

jhelwig added this pull request to the merge queue Dec 1, 2025

jhelwig removed this pull request from the merge queue due to a manual request Dec 1, 2025

S3Layer queued writes #7855

Are you sure you want to change the base?

S3Layer queued writes #7855

Uh oh!

Conversation

jhelwig commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Impact of Current State

Solution

Design Rationale

Why background queue vs fixing SDK retry blocking?

Why are all writes persisted to disk (even when queue is empty)?

Why adaptive learning rate approach?

Why separate S3 clients with different retry configs?

Why transform keys before queueing?

Why ephemeral rate limiter state?

Why DEBUG logging for throttling vs WARN?

Deployment Notes

Behavioral Changes

Monitoring

Grafana Dashboard

Migration Path

Development Environment

Uh oh!

github-actions bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

jhelwig commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhelwig commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zacharyhamm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhelwig commented Dec 1, 2025

Uh oh!

jhelwig commented Dec 1, 2025

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zacharyhamm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

jhelwig commented Nov 26, 2025 •

edited

Loading

github-actions bot commented Nov 26, 2025 •

edited

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading