-
Notifications
You must be signed in to change notification settings - Fork 295
Description
I've done a research on observability practices for large BEAM production systems and it told me a lot of interesting stories about production issues. I think it's worth a read: https://gist.github.com/balegas/ca6047972f5cecfa74009404643895a8
Problem
Electric’s sync engine relies on the BEAM virtual machine to manage large numbers of lightweight processes efficiently.
While we already export application-level metrics and distributed traces to Honeycomb via OpenTelemetry, we currently lack direct visibility into BEAM runtime behavior. This makes it difficult to determine whether performance issues such as latency spikes, throughput drops, or increased resource usage originate from application logic or from underlying VM saturation.
In particular, we have limited insight into:
- Scheduler and run queue contention.
- Memory distribution across processes, ETS tables, and binaries.
- Runaway process or mailbox growth over time.
Without these metrics, we cannot accurately assess system efficiency, detect early signs of saturation, or evaluate the resource impact of architectural decisions (e.g., process-per-tenant patterns, fanout strategies, or memory allocation behavior).
Expected Gains
Improved runtime observability will enable us to:
- Detect scheduler contention and saturation before it impacts performance.
- Identify memory leaks, fragmentation, and inefficient allocation patterns.
- Track process and ETS growth to prevent runaway resource usage.
- Quantify the effect of design decisions on BEAM resource efficiency.
- Establish meaningful alert thresholds and capacity planning baselines.
Metrics Roadmap
All metrics will be emitted periodically via Telemetry and exported to Honeycomb using the existing OpenTelemetry pipeline.
| Rank | Metric | Purpose | Source | Suggested Interval |
|---|---|---|---|---|
| 1 | beam.scheduler.utilization | Detect scheduler contention and core saturation | erlang:statistics(scheduler_wall_time) | 15s |
| 2 | beam.run_queue.length | Measure runnable process backlog | erlang:statistics(run_queue_lengths) | 15s |
| 3 | beam.memory.by_type | Observe memory usage by category (processes, binaries, ETS, atoms) | erlang:memory/0 | 15s |
| 4 | beam.process.count | Track process growth and supervision load | erlang:system_info(process_count) | 30s |
| 5 | beam.gc.pause_ms | Measure garbage collection frequency and pause times | erlang:statistics(garbage_collection) | 30s |
| 6 | beam.process.mailbox_topN | Identify processes with large message queues | recon:proc_count(message_queue_len, 10) | 30s |
| 7 | beam.ets.memory_bytes_total / beam.ets.table_size | Detect ETS memory growth or leaks | erlang:memory(ets), ets:info/2 | 30s |
| 8 | beam.memory.binary_bytes | Monitor large binary usage and reference leaks | erlang:memory(binary) | 30s |
| 9 | beam.gc.reductions_per_sec | Approximate workload throughput and CPU pressure | erlang:statistics(reductions) | 15s |
| 10 | beam.lock_contention (OTP 27+) | Detect global lock contention (e.g., timer wheel locks) | erlang:statistics(scheduler_wall_time) deltas | 30s |
Acceptance Criteria
- Periodic Telemetry events are emitted for all selected metrics.
- Metrics are exported to Honeycomb via OpenTelemetry and visible in dashboards.
- Baseline thresholds are documented for scheduler utilization, process count, and memory growth.
- An internal observability.md document describes metric meaning, collection frequency, and interpretation.
- Alert rules are established for early detection of contention and resource exhaustion.