Skip to content

Improve BEAM Runtime Observability for Electric #3237

@balegas

Description

@balegas

I've done a research on observability practices for large BEAM production systems and it told me a lot of interesting stories about production issues. I think it's worth a read: https://gist.github.com/balegas/ca6047972f5cecfa74009404643895a8

Problem

Electric’s sync engine relies on the BEAM virtual machine to manage large numbers of lightweight processes efficiently.
While we already export application-level metrics and distributed traces to Honeycomb via OpenTelemetry, we currently lack direct visibility into BEAM runtime behavior. This makes it difficult to determine whether performance issues such as latency spikes, throughput drops, or increased resource usage originate from application logic or from underlying VM saturation.

In particular, we have limited insight into:

  • Scheduler and run queue contention.
  • Memory distribution across processes, ETS tables, and binaries.
  • Runaway process or mailbox growth over time.

Without these metrics, we cannot accurately assess system efficiency, detect early signs of saturation, or evaluate the resource impact of architectural decisions (e.g., process-per-tenant patterns, fanout strategies, or memory allocation behavior).

Expected Gains

Improved runtime observability will enable us to:

  • Detect scheduler contention and saturation before it impacts performance.
  • Identify memory leaks, fragmentation, and inefficient allocation patterns.
  • Track process and ETS growth to prevent runaway resource usage.
  • Quantify the effect of design decisions on BEAM resource efficiency.
  • Establish meaningful alert thresholds and capacity planning baselines.

Metrics Roadmap

All metrics will be emitted periodically via Telemetry and exported to Honeycomb using the existing OpenTelemetry pipeline.

Rank Metric Purpose Source Suggested Interval
1 beam.scheduler.utilization Detect scheduler contention and core saturation erlang:statistics(scheduler_wall_time) 15s
2 beam.run_queue.length Measure runnable process backlog erlang:statistics(run_queue_lengths) 15s
3 beam.memory.by_type Observe memory usage by category (processes, binaries, ETS, atoms) erlang:memory/0 15s
4 beam.process.count Track process growth and supervision load erlang:system_info(process_count) 30s
5 beam.gc.pause_ms Measure garbage collection frequency and pause times erlang:statistics(garbage_collection) 30s
6 beam.process.mailbox_topN Identify processes with large message queues recon:proc_count(message_queue_len, 10) 30s
7 beam.ets.memory_bytes_total / beam.ets.table_size Detect ETS memory growth or leaks erlang:memory(ets), ets:info/2 30s
8 beam.memory.binary_bytes Monitor large binary usage and reference leaks erlang:memory(binary) 30s
9 beam.gc.reductions_per_sec Approximate workload throughput and CPU pressure erlang:statistics(reductions) 15s
10 beam.lock_contention (OTP 27+) Detect global lock contention (e.g., timer wheel locks) erlang:statistics(scheduler_wall_time) deltas 30s

Acceptance Criteria

  • Periodic Telemetry events are emitted for all selected metrics.
  • Metrics are exported to Honeycomb via OpenTelemetry and visible in dashboards.
  • Baseline thresholds are documented for scheduler utilization, process count, and memory growth.
  • An internal observability.md document describes metric meaning, collection frequency, and interpretation.
  • Alert rules are established for early detection of contention and resource exhaustion.

Sub-issues

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions