Improve BEAM Runtime Observability for Electric

> I've done a research on observability practices for large BEAM production systems and it told me a lot of interesting stories about production issues. I think it's worth a read: https://gist.github.com/balegas/ca6047972f5cecfa74009404643895a8

# Problem

Electric’s sync engine relies on the BEAM virtual machine to manage large numbers of lightweight processes efficiently.
While we already export application-level metrics and distributed traces to Honeycomb via OpenTelemetry, we currently lack direct visibility into BEAM runtime behavior. This makes it difficult to determine whether performance issues such as latency spikes, throughput drops, or increased resource usage originate from application logic or from underlying VM saturation.

In particular, we have limited insight into:
- Scheduler and run queue contention.
- Memory distribution across processes, ETS tables, and binaries.
- Runaway process or mailbox growth over time.

Without these metrics, we cannot accurately assess system efficiency, detect early signs of saturation, or evaluate the resource impact of architectural decisions (e.g., process-per-tenant patterns, fanout strategies, or memory allocation behavior).

# Expected Gains

Improved runtime observability will enable us to:
- Detect scheduler contention and saturation before it impacts performance.
 - Identify memory leaks, fragmentation, and inefficient allocation patterns.
- Track process and ETS growth to prevent runaway resource usage.
- Quantify the effect of design decisions on BEAM resource efficiency.
- Establish meaningful alert thresholds and capacity planning baselines.

# Metrics Roadmap

All metrics will be emitted periodically via Telemetry and exported to Honeycomb using the existing OpenTelemetry pipeline.

Rank | Metric | Purpose | Source | Suggested Interval
-- | -- | -- | -- | --
1 | beam.scheduler.utilization | Detect scheduler contention and core saturation | erlang:statistics(scheduler_wall_time) | 15s
2 | beam.run_queue.length | Measure runnable process backlog | erlang:statistics(run_queue_lengths) | 15s
3 | beam.memory.by_type | Observe memory usage by category (processes, binaries, ETS, atoms) | erlang:memory/0 | 15s
4 | beam.process.count | Track process growth and supervision load | erlang:system_info(process_count) | 30s
5 | beam.gc.pause_ms | Measure garbage collection frequency and pause times | erlang:statistics(garbage_collection) | 30s
6 | beam.process.mailbox_topN | Identify processes with large message queues | recon:proc_count(message_queue_len, 10) | 30s
7 | beam.ets.memory_bytes_total / beam.ets.table_size | Detect ETS memory growth or leaks | erlang:memory(ets), ets:info/2 | 30s
8 | beam.memory.binary_bytes | Monitor large binary usage and reference leaks | erlang:memory(binary) | 30s
9 | beam.gc.reductions_per_sec | Approximate workload throughput and CPU pressure | erlang:statistics(reductions) | 15s
10 | beam.lock_contention (OTP 27+) | Detect global lock contention (e.g., timer wheel locks) | erlang:statistics(scheduler_wall_time) deltas | 30s

# Acceptance Criteria

- Periodic Telemetry events are emitted for all selected metrics.
- Metrics are exported to Honeycomb via OpenTelemetry and visible in dashboards.
- Baseline thresholds are documented for scheduler utilization, process count, and memory growth.
- An internal observability.md document describes metric meaning, collection frequency, and interpretation.
- Alert rules are established for early detection of contention and resource exhaustion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve BEAM Runtime Observability for Electric #3237

Problem

Expected Gains

Metrics Roadmap

Acceptance Criteria

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rank	Metric	Purpose	Source	Suggested Interval
1	beam.scheduler.utilization	Detect scheduler contention and core saturation	erlang:statistics(scheduler_wall_time)	15s
2	beam.run_queue.length	Measure runnable process backlog	erlang:statistics(run_queue_lengths)	15s
3	beam.memory.by_type	Observe memory usage by category (processes, binaries, ETS, atoms)	erlang:memory/0	15s
4	beam.process.count	Track process growth and supervision load	erlang:system_info(process_count)	30s
5	beam.gc.pause_ms	Measure garbage collection frequency and pause times	erlang:statistics(garbage_collection)	30s
6	beam.process.mailbox_topN	Identify processes with large message queues	recon:proc_count(message_queue_len, 10)	30s
7	beam.ets.memory_bytes_total / beam.ets.table_size	Detect ETS memory growth or leaks	erlang:memory(ets), ets:info/2	30s
8	beam.memory.binary_bytes	Monitor large binary usage and reference leaks	erlang:memory(binary)	30s
9	beam.gc.reductions_per_sec	Approximate workload throughput and CPU pressure	erlang:statistics(reductions)	15s
10	beam.lock_contention (OTP 27+)	Detect global lock contention (e.g., timer wheel locks)	erlang:statistics(scheduler_wall_time) deltas	30s

Improve BEAM Runtime Observability for Electric #3237

Description

Problem

Expected Gains

Metrics Roadmap

Acceptance Criteria

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions