-
Notifications
You must be signed in to change notification settings - Fork 256
S3Layer queued writes #7855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
S3Layer queued writes #7855
Conversation
Dependency Review✅ No vulnerabilities or OpenSSF Scorecard issues found.Scanned FilesNone |
e33394f to
616b129
Compare
64c3f23 to
d525e65
Compare
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
Implements adaptive rate limiting using gradient descent algorithm to smoothly converge on optimal S3 request rate without the indefinite oscillation that MIMD (multiplicative increase, multiplicative decrease) exhibits. # Algorithm On throttle: backoff increases by `learning_rate * adjustment_size_ms` On success: after N consecutive successes, backoff decreases by the same formula Learning rate adapts: - Grows on repeated throttling (larger steps to escape throttle zone) - Shrinks during recovery (smaller steps for precision near optimal) # Convergence Simulations show 15-25 iterations to converge with ±7.5% oscillation around target rate, vs MIMD's indefinite ±33% oscillation. # Configuration All parameters configurable via RateLimitConfig: - adjustment_size_ms: Base step size for adjustments - initial_learning_rate: Starting learning rate when throttled - min/max_learning_rate: Bounds for adaptation - learning_rate_growth: Multiplier when throttled (>1.0) - learning_rate_shrink: Multiplier when reducing (<1.0) - successes_before_reduction: Validation threshold - zeno_threshold_ms: Below this, backoff jumps to zero Includes comprehensive validation to catch misconfigurations at load time.
…ring We cannot lose S3 writes under any circumstances - it would be catastrophic data loss. Memory-only queues don't provide sufficient guarantees because the AWS SDK keeps retry data in memory for minutes, and any crash during that window loses the write permanently. This commit implements disk-based persistence where writes are atomically persisted before being acknowledged. We reuse the retry queue's Postcard serialization format and ULID-based filenames for chronological ordering because it's proven reliable in production. Non-retryable errors (serialization failures from corrupted data) are moved to a dead letter queue. This preserves evidence for debugging while preventing the corrupted write from blocking queue processing indefinitely. This design prepares for eventual PostgreSQL layer removal. S3Layer will operate as the sole persistence layer with its own infrastructure rather than depending on shared retry queue mechanisms.
Different S3 errors require different handling strategies. Throttling responses (503) are expected operational conditions that should trigger backoff and be logged at DEBUG level. Serialization errors indicate corrupted data that cannot be retried and must go to the dead letter queue. Everything else represents transient failures that should retry with backoff. The key insight is that throttling isn't a problem - it's S3 telling us we're writing too fast, and the rate limiter handles this automatically. Treating it as WARN-level creates operational noise that obscures genuine problems. We extract error classification to a module-level function so both S3Layer (for read operations) and S3QueueProcessor (for writes) use consistent logic. Classification uses HTTP status codes primarily, with fallback to error message patterns for S3-compatible services that may not use standard status codes.
92b78aa to
10b462e
Compare
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
…ueue into S3Layer The core architectural decision here is that ALL S3 writes go through the persistent queue with no fast path. This might seem excessive when the queue is empty and we're not rate-limited, but consider what happens if we write immediately: we could receive a 503 response, and if the process crashes before we persist the write to the queue, it's lost forever. We accept the trade-off of every write hitting disk because durability is more important than write performance. This becomes critical as we prepare to remove PostgreSQL entirely - S3Layer must operate as a self-sufficient persistence layer without relying on shared retry infrastructure. The processor runs as a background task that dequeues the oldest writes (ULID ordering preserves chronology), applies the current backoff delay, and writes directly to S3. We use separate S3 clients with different retry configurations: the S3Layer client has SDK retry enabled for resilient reads, while the processor client has retry disabled because the queue itself provides application-level retry. This prevents double-retry and timeout multiplication. Keys are transformed at the API boundary before queueing. The processor doesn't know about transformation strategies or key prefixes - it just uses whatever key is in the queue. This architectural simplification eliminates processor dependency on S3Layer's internal transformation logic. On shutdown, we complete any in-flight write and exit immediately without draining the queue. The queue persists across restarts, and the rate limiter rediscovers appropriate limits naturally during processing. Keeping the rate limiter state ephemeral simplifies the implementation without compromising durability.
Moving throttling detection from WARN to DEBUG logs reduces noise, but operators still need visibility into system behavior. Metrics provide this visibility without cluttering logs - operators can graph backoff trends over time, create alerts on sustained high backoff without false alarms, and understand how queue depth correlates with backoff state. The four panels show queue depth (detecting backpressure), backoff state (visualizing rate limiting in action), success rate (tracking write outcomes over time), and latency distribution (end-to-end timing from enqueue through successful write). Together these give operators a complete picture of S3 write health without requiring log analysis.
The architecture shift from "write fast, handle failures reactively" to "write at sustainable pace proactively" isn't obvious from code alone. Operators need to understand how the system behaves, why we made these trade-offs, and how to interpret what they see in production. The README provides an operational guide: how to configure rate limiting for different workloads, which metrics to monitor and when to alert, where queue files live on disk, and what happens during startup and shutdown. This context helps operators make informed decisions about tuning and troubleshooting. Inline documentation explains the completed architecture after refactoring. We use separate S3 clients because reads need SDK retry for resilience while writes need it disabled to prevent double-retry. Keys are transformed before queueing so the processor doesn't need to know about transformation strategies. The processor uses keys directly from the queue as a deliberate architectural simplification.
The versitygw binary moved from /usr/local/bin/versitygw to /app/versitygw in the container image, but the docker-compose configuration wasn't updated. This caused the container to exit with code 127 (command not found) after the init-buckets.sh script completed. Update the exec path to /app/versitygw to match the current container image structure.
Complete overhaul of the layer cache metrics dashboard to improve readability, add missing operational metrics, and organize panels by workflow. Template Variables: - Add operation template for filtering persistence latency metrics - Convert backend template from hardcoded to dynamic query discovering all backends (foyer, postgres, s3) automatically - Expand service template to include read-only services (edda) using union query across multiple metrics Panel Organization: Restructured 42 panels into 5 logical sections matching operational workflows: 1. OVERVIEW & HEALTH - System health scorecard with error rates, queue depths, resolution rates, and write throughput balance 2. READ PATH PERFORMANCE - Request latency, backend latency comparison, cache miss rates, fallback tracking 3. WRITE PATH - S3 PERSISTENCE - Queue depth, backoff state, write throughput, error breakdown, dead letter queue 4. WRITE PATH - POSTGRES PERSISTENCE - Persister operations, persistence latency, retry queue depth and rates 5. EVICTION OPERATIONS - Eviction rates, failures, latency, and memory-only evictions Readability Improvements: - All p50/p95/p99 panels now show exactly 3 data series using panel repeat and aggregation - Panel 26 (Persistence Latency): Repeats by operation, aggregates other dimensions - Panel 27 (Backend Read Latency): Aggregates across caches - Panel 28 (Request Latency): Repeats by cache - Panel 29 (Backend Hit Latency): Repeats by cache, shows all backends together, removed miss series New Metrics (8 panels): - System Health Scorecard: Overall error rate, S3 queue depth, retry queue depth with color-coded thresholds - Fallback Tracking: S3→PostgreSQL fallback rate indicating data availability issues - S3 Write Error Breakdown: Stacked timeseries by error category - S3 Dead Letter Queue Depth: Corrupted write tracking - Write Throughput Balance: Compares S3 vs PostgreSQL write rates for dual-write monitoring - Memory-Only Evictions: Tracks Foyer memory evictions without disk persistence - Overall Cache Miss Rate: Single aggregated panel showing complete misses per cache Removed Redundant Panels (3): - Retry Queue Success vs Failure Ratio (data visible in other panels) - S3 Write Failures by Error Category (replaced by comprehensive breakdown) - Persister Retry Operations (superseded by specific retry queue panels) Technical Details: - Overall cache miss uses layer_cache_request_latency_ms metric (verified in layer_cache.rs:347-351) tracking complete misses across all layers - All backends explicitly report both hit and miss metrics (verified in layer_cache.rs:154-308) - Template union queries ensure all backends and services appear dynamically
10b462e to
808cf48
Compare
zacharyhamm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code looks pretty solid (although it's giving claude). A few comments. Will look again when moved out of draft. I'm curious: is it not possible to do s3 writes in parallel for better throughput? Although I suppose we're going to be doing lots of parallel writes for every workspace in the whole service so maybe a single queue for each layer cache makes sense
| /// attempting the next operation. Returns `Duration::ZERO` when no delay | ||
| /// is needed. | ||
| pub fn current_delay(&self) -> Duration { | ||
| Duration::from_millis(self.current_backoff_ms as u64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f64 can be both larger and smaller than u64 (overflow/underflow). Do we care here? not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max u64 milliseconds would be 5.845421e+8 years, so we should be staying well away from any over-/under-flow risk due to the clamping elsewhere.
|
|
||
| // Write atomically (write to temp, then rename) | ||
| let temp_path = self.queue_dir.join(format!("{ulid}.tmp")); | ||
| fs::write(&temp_path, &bytes)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we care about fsync here? It will probably be a perf hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did have fsync here in at one point while working on it, and I keep flipping back and forth on it. The writes come through pretty quickly on an active system, so doing an fsync on every write will add up very quickly. It feels like what we'd want is an fsync timer (make sure there is one at least every N seconds), instead of on every individual write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, we can do that just by tuning the kernel (there's a dirty page expire sysctl flag and a few other things). Linux by default will ensure pages older than 30 seconds are written to disk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this really makes me wonder if an sqlite3 database isn't a better solution than raw disk files for managing the queues. we can think about it in the future. this likely means a hard crash will leave some number of corrupted entries
| if let Err(dlq_err) = self.move_corrupted_file_to_dlq(&path, ulid) { | ||
| tracing::error!( | ||
| "Failed to move corrupted file {} to DLQ: {}", | ||
| path.display(), | ||
| dlq_err | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If moving the corrupted file to the DLQ fails, does this mean we'll encounter it again next time and attempt to process it again? not sure of a simple solution to avoid this here but seems worth calling out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the move fails, then we've likely got bigger FS issues as these should be on the same filesystem, and the move is an fs::rename, so it hould just be inode updates, not "actual" data writing. It should also leave the bad file in the original place (part of the "normal"queue), unless there are catastrophic FS issues.
| #[cfg(test)] | ||
| mod tests { | ||
| use super::*; | ||
| impl Drop for S3Layer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how doing the shutdown this way will interact with the graceful shutdown handlers. Everywhere else we block with task trackers and cancellation token. There's one for layer db already.
Is there a risk we won't finish in-progress write operations before the termination completes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's always a risk, but it should be safe if that happens, since we don't remove the queue entry until after the write is successful. The possibly-not-completed write would be picked up on service restart, and attempted again.
Next steps to increase throughput that I can think of:
These mainly would come into play after S3 has scaled up to the point where we're no longer getting any 503s in the serial, single, per-cache queueing, though. |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
zacharyhamm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see how it works in the wild
Problem
S3 throttling happens frequently in production, generating three log messages per throttle event: "dual write failed", "enqueued for retry", and "retry successful". This operational noise makes it difficult to spot genuine problems in production logs and prevents creating useful retry queue alerts without constant false alarms. Log volume also impacts Honeycomb with rate limiting and noise.
The current system writes to S3 as fast as possible and handles throttling reactively. The dual-write implementation uses
tokio::join!to write to both PostgreSQL and S3 in parallel. When S3 throttling occurs, the native SDK retry logic blocks the entire write operation, making PostgreSQL writes wait for S3's exponential backoff to complete. Since S3 is almost always the slower backend during capacity ramp-up, this significantly degrades layer cache write performance.Impact of Current State
Solution
This PR shifts from "write as fast as possible, handle failures reactively" to "write at sustainable pace with adaptive rate limiting." It introduces three new components:
S3WriteQueue: Persistent disk-based queue using Postcard serialization and ULID ordering for chronological processing. All S3 writes are atomically persisted to disk before acknowledgment, ensuring no data loss even during crashes. Non-retryable errors (serialization failures from corrupted data) are moved to a dead letter queue to preserve evidence while preventing corrupted writes from blocking queue processing.
RateLimiter: Adaptive rate limiting that adjusts delay between S3 writes based on throttling responses. Increases delay on throttling (503), decreases delay after consecutive successes. Uses learning rate adaptation - growing on repeated throttling (larger adjustment steps to reduce request rate quickly), shrinking during recovery (smaller steps to validate stability at each rate level before ramping up).
S3QueueProcessor: Background task that dequeues oldest writes, applies backoff delays, and writes to S3. Separate S3 clients with different retry configurations: reads use SDK retry for resilience, writes use disabled retry because the queue provides application-level retry (prevents double-retry and timeout multiplication).
After this change:
Design Rationale
Why background queue vs fixing SDK retry blocking?
The SDK retry blocking is fundamental to how tokio::join! works - it waits for all futures to complete. We could disable SDK retry entirely, but then we lose retry behavior for reads and need to handle all transient S3 errors ourselves. The queue-based approach gives us both: SDK retry for resilient reads, application-level retry for writes without blocking PostgreSQL.
Why are all writes persisted to disk (even when queue is empty)?
An immediate write could receive 503 from S3 even when the queue is empty. If the process crashes between receiving the 503 and persisting to the queue, the write is lost permanently with no recovery path. Disk persistence before acknowledgment ensures zero data loss, even during crashes. The performance overhead of hitting disk is acceptable because durability guarantees are non-negotiable for cache persistence.
Why adaptive learning rate approach?
Standard exponential backoff uses fixed multipliers (e.g., multiply by M on throttle, divide by N on success) which causes indefinite oscillation around the throttle threshold. Each recovery overshoots and triggers throttling again because the adjustment size never changes.
The adaptive approach adjusts the step size based on system response: growing when repeatedly throttled (larger steps needed to escape throttle zone), shrinking after successful reductions (smaller steps to find sustainable rate precisely). This reduces repeated throttle events during recovery by validating stability at each rate level before making the next adjustment.
Why separate S3 clients with different retry configs?
S3Layer reads need SDK retry for resilience. Processor writes need it disabled to prevent double-retry and timeout multiplication (queue provides application-level retry). Using same client would force choosing one behavior over the other.
Why transform keys before queueing?
Processor doesn't need to know about transformation strategies or key prefixes. It just uses whatever key is in the queue. This architectural simplification eliminates processor dependency on S3Layer's internal transformation logic.
Why ephemeral rate limiter state?
Simpler implementation, no persistence overhead. Rate limiter rediscovers appropriate limits naturally during processing after restart. Queue persistence ensures durability; rate limiter just optimizes request timing.
Why DEBUG logging for throttling vs WARN?
Different S3 errors require different handling strategies. Throttling responses (503) are expected operational conditions that the rate limiter handles automatically - they're S3 telling us we're writing too fast, not a problem requiring human intervention. Logging at WARN creates operational noise that obscures genuine problems. Serialization errors indicate corrupted data and go to ERROR (requires investigation). Everything else represents transient failures logged at WARN with backoff retry.
Deployment Notes
Behavioral Changes
Monitoring
New metrics for alerting and graphing:
s3_write_queue_depth: Current pending writes (alert on sustained growth)s3_write_backoff_ms: Current backoff delay (visualize rate limiting)s3_write_attempts_total: Counter by result (success/throttle/error_*)s3_write_duration_ms: End-to-end latency from enqueue to completions3_dlq_depth: Dead letter queue depth for corrupted writeslayer_cache_read_fallback: S3→PostgreSQL fallback rateGrafana Dashboard
Complete reorganization of layer cache metrics dashboard (42 panels) into 5 workflow-based sections with improved readability and new operational metrics:
Dashboard improvements:
The graphs in the screenshot below were generated in my dev environment by sticking a simple HTTP proxy between the services and versitygw. The proxy would limit ObjectPut to 5 requests/second globally, and return a 503 when requests exceed that rate.
Migration Path
{cache_name}_s3_queue/{ULID}.pending{cache_name}_s3_queue/dead_letter/{ULID}.corruptedServices using
PostgresOnlymode are unaffected. Services using S3-enabled modes (DualWrite,S3Primary,S3Only) will create queue directories on startup.Development Environment
Updated versitygw binary path in docker-compose from
/usr/local/bin/versitygwto/app/versitygwto match current container image (container was exiting with code 127 after init-buckets.sh completed).