A comprehensive, production-grade educational platform for learning OTP (Open Telecom Platform) supervision concepts through hands-on experimentation and real-time visualization.
This project provides an advanced learning environment for OTP concepts, combining theoretical knowledge with practical implementation. Built with Phoenix LiveView, it offers:
Interactive supervisor management dashboard with real-time visualization
System-wide monitoring interface with analytics and health metrics
Advanced operations dashboard for OTP process management and debugging
Comprehensive documentation and educational resources for OTP concepts
- π¬ Interactive Sandbox Environments - Isolated experimentation without system risk
- π Real-time Analytics - Production-grade telemetry and monitoring systems
- π― Educational Focus - Progressive complexity with comprehensive documentation
- β‘ Live Visualization - Dynamic supervision tree updates and process management
- π οΈ Advanced Tooling - Message tracing, process introspection, and system analysis
- ποΈ Production Patterns - Real-world OTP design patterns and best practices
mix setup # Install dependencies and setup database
mix phx.server # Start server β http://localhost:4000
# OR
iex -S mix phx.server # Start with interactive console
/supervisors
- Interactive supervisor management dashboard/system
- System-wide analytics and monitoring/api/v1/*
- RESTful API for programmatic access
β Completed Features:
- Distributed tooling foundation with mode switching
- Single-node simulator for development without real cluster
- Cluster state manager with real-time topology tracking
- Distributed Arsenal operations (ClusterHealth, NodeInfo, ClusterTopology, ProcessList)
- Comprehensive test suite with proper OTP synchronization
π§ In Progress:
- Horde-specific Arsenal operations
- Enhanced existing Arsenal operations for distributed support
- Distributed sandbox management system
# Start the server normally - simulation mode is enabled by default
mix phx.server
# OR with interactive console
iex -S mix phx.server
# Terminal 1: Start primary node
./scripts/start_node1.sh
# This runs: iex --sname superlearner1 --cookie secret_cluster_cookie -S mix phx.server
# Web interface: http://localhost:4000
# Terminal 2: Start secondary node (if needed)
./scripts/start_node2.sh
# This runs: iex --sname superlearner2 --cookie secret_cluster_cookie -S mix
# Check cluster status
Node.list() # List connected nodes
Node.self() # Current node name
# Test distributed tooling
OTPSupervisor.Distributed.ToolManager.get_mode() # Current mode
OTPSupervisor.Distributed.ToolManager.get_cluster_status() # Cluster info
# Test cluster state management
OTPSupervisor.Distributed.ClusterStateManager.get_cluster_topology()
OTPSupervisor.Distributed.ClusterStateManager.get_process_distribution()
# Test single-node simulation
OTPSupervisor.Distributed.SingleNodeSimulator.simulation_enabled?()
OTPSupervisor.Distributed.SingleNodeSimulator.enable_simulation(3)
OTPSupervisor.Distributed.SingleNodeSimulator.get_simulated_topology()
Example Output from Real Multi-Node Test:
iex(superlearner1@U2401)> Node.connect(:"superlearner2@U2401")
[info] Node joined cluster: superlearner2@U2401
true
iex(superlearner1@U2401)> Node.list()
[:superlearner2@U2401]
iex(superlearner1@U2401)> OTPSupervisor.Distributed.ToolManager.get_cluster_status()
%{
nodes: [:superlearner1@U2401, :superlearner2@U2401],
mode: :multi_node,
connected_nodes: [:superlearner2@U2401],
current_node: :superlearner1@U2401,
tools: []
}
iex(superlearner1@U2401)> OTPSupervisor.Distributed.ClusterStateManager.get_cluster_topology()
%{
nodes: [:superlearner1@U2401, :superlearner2@U2401],
connected_nodes: [:superlearner2@U2401],
current_node: :superlearner1@U2401,
total_nodes: 2,
cluster_name: :superlearner_cluster,
formation_time: ~U[2025-07-15 01:27:17.152676Z]
}
iex(superlearner1@U2401)> OTPSupervisor.Distributed.ClusterStateManager.get_process_distribution()
%{
superlearner1@U2401: [#PID<0.0.0>, #PID<0.1.0>, ...], # 598 processes
superlearner2@U2401: [#PID<27148.0.0>, #PID<27148.1.0>, ...] # Similar count
}
# Test cluster health endpoint
curl http://localhost:4000/api/v1/cluster/health
# Test cluster topology
curl http://localhost:4000/api/v1/cluster/topology
# Test node information (replace with actual node name)
curl http://localhost:4000/api/v1/cluster/nodes/superlearner1@U2401/info
# Test distributed process list
curl http://localhost:4000/api/v1/cluster/processes
Example JSON Response from Cluster Health:
{
"data": {
"overall_status": "healthy",
"partition_status": "healthy",
"nodes_total": 2,
"nodes_healthy": 2,
"nodes_unhealthy": 0,
"cluster_uptime": "37s",
"node_statuses": {
"superlearner1@U2401": {
"status": "up",
"health_score": 100,
"cpu_status": "normal",
"memory_status": "normal",
"issues": []
},
"superlearner2@U2401": {
"status": "up",
"health_score": 100,
"cpu_status": "normal",
"memory_status": "normal",
"issues": []
}
},
"performance_metrics": {
"total_processes": 1194,
"memory_usage": {
"superlearner1@U2401": {"total": 86914960, "processes": 24422864},
"superlearner2@U2401": {"total": 86401672, "processes": 24214256}
}
},
"recommendations": []
},
"success": true,
"timestamp": "2025-07-15T01:27:54.227947Z"
}
# Run distributed-specific tests
mix test test/otp_supervisor/distributed/
# Run all tests
mix test
# Check for compiler warnings and dialyzer issues
mix compile --warnings-as-errors
mix dialyzer
# Format code
mix format
# Check Erlang statistics (useful for understanding system behavior)
mix run -e "IO.inspect(:erlang.statistics(:reductions)); IO.inspect(:erlang.statistics(:io)); IO.inspect(:erlang.statistics(:garbage_collection)); IO.inspect(:erlang.statistics(:run_queue))"
# Start with debugging enabled
MIX_ENV=dev LOG_LEVEL=debug ./scripts/start_node1.sh
# Check cluster status
./scripts/cluster_status.sh
# Stop nodes gracefully
./scripts/stop_node1.sh
./scripts/stop_node2.sh
# Kill processes on specific ports (if needed)
./scripts/kill_beam_on_port.sh 4000
Distributed Development Tooling
βββ ToolManager # Central coordinator for distributed tools
βββ SingleNodeSimulator # Development mode simulation
βββ ClusterStateManager # Real-time cluster topology tracking
βββ Arsenal Operations
βββ ClusterHealth # GET /api/v1/cluster/health
βββ ClusterTopology # GET /api/v1/cluster/topology
βββ NodeInfo # GET /api/v1/cluster/nodes/:node/info
βββ ProcessList # GET /api/v1/cluster/processes
- LibCluster: Configured for automatic node discovery
- Horde: Available for distributed process registry and supervision
- Simulation Mode: Automatically enabled in single-node development
- Multi-Node Mode: Automatically detected when nodes connect
For comprehensive distributed testing capabilities, see:
- README_DISTRIBUTED_TESTING.md - Professional distributed testing framework
- README_ROBUST_CLUSTER_TESTING.md - WSL-compatible automated cluster management
# Professional distributed test cluster management
mix cluster.test status # Check cluster status
mix cluster.test start # Start test cluster
mix cluster.test health # Comprehensive health check
mix cluster.test run # Full automated test cycle
mix cluster.test stop # Stop cluster
# Run distributed tests
mix test --only real_nodes # Real multi-node tests
mix test --only simulation # Simulation-based tests
Key Features:
- β Automated Cluster Management - No more manual server scripts
- β Robust Error Handling - Addresses all distributed Erlang startup issues
- β Code Synchronization - Prevents testing against stale code
- β Health Monitoring - Comprehensive cluster validation
- β Professional Tooling - Production-grade CLI interface
# Server Management
mix phx.server # Start Phoenix server
iex -S mix phx.server # Start with IEx console
mix test # Run comprehensive test suite
mix test --cover # Run tests with coverage analysis
# Code Quality
mix format # Format code
mix compile --warnings-as-errors
mix dialyzer # Static analysis (if configured)
# Database
mix ecto.setup # Setup database
mix ecto.migrate # Run migrations
# Supervisor Management
Control.list_supervisors() # List all supervisors
Control.get_supervision_tree(:demo_one_for_one) # Inspect hierarchy
# Process Management
Control.list_all_processes() # System process overview
Control.kill_process(pid) # Safe process termination
Control.get_process_info(pid) # Process introspection
# Sandbox Operations
SandboxManager.create_sandbox("test", TestDemoSupervisor)
SandboxManager.list_sandboxes() # Active sandboxes
SandboxManager.destroy_sandbox("test") # Cleanup
# Analytics
AnalyticsServer.get_restart_history(sup_pid) # Restart analytics
AnalyticsServer.get_failure_rate(sup_pid) # Failure rate analysis
# Message Tracing
MessageTracer.trace_messages(pid) # Start tracing
MessageTracer.get_message_history(pid) # Retrieve messages
MessageTracer.stop_tracing(pid) # Stop tracing
OtpSupervisor.Application (Root Supervisor)
βββ OtpSupervisorWeb.Telemetry # Web metrics and monitoring
βββ Phoenix.PubSub # Real-time communication
βββ Registry (TracerRegistry) # Message tracer management
βββ AnalyticsServer # Supervisor monitoring system
βββ SandboxManager # Isolated testing environments
βββ OtpSupervisorWeb.Endpoint # Phoenix web endpoint
βββ DemoSupervisor # Educational examples
Central API hub providing comprehensive OTP process management:
Key Capabilities:
- Process Discovery: List and classify all system processes
- Supervisor Management: Deep inspection of supervision hierarchies
- Safe Termination: Controlled process killing with proper cleanup
- System Analysis: Process topology mapping and relationship analysis
- Educational Tools: Functions designed for learning OTP patterns
API Highlights:
# Process Management
Control.list_supervisors() # All supervisors with metadata
Control.get_supervision_tree(sup) # Complete hierarchy inspection
Control.kill_process(pid) # Safe process termination
Control.get_process_info(pid) # Comprehensive process details
# System Analysis
Control.list_all_processes() # System-wide process overview
Control.build_process_graph() # Visual topology mapping
Control.get_process_state(pid) # GenServer state inspection
Control.detect_system_anomalies() # Automated health analysis
Production-grade monitoring system for supervisor behavior analysis:
Implementation Details:
- Process Monitoring Approach: Uses PID change detection for restart tracking
- Bounded Storage: Maintains last 1000 events per supervisor
- Real-time Analytics: Failure rates, trend analysis, and health metrics
- Zero External Dependencies: Pure OTP implementation
Key Features:
# Supervisor Registration & Monitoring
AnalyticsServer.register_supervisor(pid, name)
AnalyticsServer.establish_baseline(pid)
# Analytics & Reporting
AnalyticsServer.get_restart_history(pid) # Historical restart data
AnalyticsServer.get_failure_rate(pid, window) # Time-windowed analysis
AnalyticsServer.get_all_supervisor_stats() # System-wide statistics
# Synchronization (for testing)
AnalyticsServer.sync(pid) # Force immediate scan
Isolated experimentation platform for safe OTP learning:
Production Patterns:
- Dynamic Supervisor Creation: Unique naming and resource isolation
- ETS-based Fast Lookup: O(1) sandbox metadata access
- Automatic Cleanup: Process monitoring with graceful shutdown
- Resource Management: Proper supervisor lifecycle with unlinking
Sandbox Operations:
# Sandbox Lifecycle
SandboxManager.create_sandbox(id, module, opts) # Create isolated environment
SandboxManager.restart_sandbox(id) # Full restart with config preservation
SandboxManager.destroy_sandbox(id) # Graceful cleanup
# Introspection
SandboxManager.list_sandboxes() # All active sandboxes
SandboxManager.get_sandbox_info(id) # Detailed sandbox metadata
SandboxManager.get_sandbox_pid(id) # Direct PID access
Advanced debugging system for OTP message flow analysis:
Technical Implementation:
- Erlang Trace Integration: Low-overhead system tracing
- Registry-based Management: Automatic tracer cleanup
- Pattern Analysis: Message flow statistics and debugging insights
- Educational Focus: Demonstrates proper OTP debugging techniques
Tracing Operations:
# Message Tracing
MessageTracer.trace_messages(pid, opts) # Start comprehensive tracing
MessageTracer.get_message_history(pid) # Retrieve captured messages
MessageTracer.analyze_message_patterns(msgs) # Pattern analysis
MessageTracer.stop_tracing(pid) # Clean shutdown
lib/
βββ otp_supervisor.ex # Main application module
βββ otp_supervisor/
β βββ application.ex # OTP application & supervision tree
β βββ core/ # Core OTP functionality
β β βββ control.ex # Central process management API
β β βββ analytics_server.ex # Production-grade monitoring
β β βββ sandbox_manager.ex # Isolated testing environments
β β βββ message_tracer.ex # Advanced debugging tools
β βββ sandbox/ # Educational examples
β βββ supervisors/
β β βββ demo_supervisor.ex # Main educational supervisor
β βββ test_demo_supervisor.ex # Isolated testing supervisor
β βββ workers/ # Example worker processes
β βββ counter.ex # Stateful GenServer example
β βββ printer.ex # Message processing example
βββ otp_supervisor_web.ex # Phoenix web module
βββ otp_supervisor_web/ # Web interface
βββ components/ # Reusable LiveView components
βββ controllers/api/v1/ # RESTful API endpoints
βββ live/ # LiveView interfaces
β βββ supervisor_live.ex # Supervisor management dashboard
β βββ system_dashboard_live.ex # System-wide monitoring
βββ telemetry.ex # Web telemetry configuration
test/
βββ otp_supervisor/
β βββ core/ # Core functionality tests
β β βββ control_test.exs # Process management API
β β βββ analytics_server_test.exs # Real monitoring system tests
β β βββ sandbox_manager_test.exs # Sandbox isolation tests
β β βββ message_tracer_test.exs # Tracing functionality
β βββ sandbox/ # Worker behavior tests
β βββ test_demo_supervisor_test.exs
β βββ workers/
βββ otp_supervisor_web/ # Web interface tests
β βββ live/ # LiveView integration tests
β βββ controllers/api/v1/ # API endpoint tests
βββ support/
βββ supervisor_test_helper.ex # Advanced test utilities
The platform is designed for step-by-step OTP mastery:
- Basic Concepts: Start with supervision tree visualization
- Process Lifecycle: Experiment with process killing and restart behavior
- Advanced Patterns: Explore different supervision strategies
- Debugging Techniques: Use message tracing for system analysis
- Production Practices: Study real-world implementation patterns
One-for-One Strategy (:one_for_one
):
- Failed child restarted individually
- Other children unaffected
- Most common production pattern
One-for-All Strategy (:one_for_all
):
- All children restarted when one fails
- Used when children are interdependent
- Ensures consistent system state
Rest-for-One Strategy (:rest_for_one
):
- Failed child and subsequent children restarted
- Maintains startup order dependencies
- Useful for pipeline architectures
Sandbox Environments:
- Isolated testing without system impact
- Concurrent experiment support
- Automatic resource cleanup
- Configuration preservation across restarts
Real-time Visualization:
- Dynamic supervision tree updates
- Process state change monitoring
- Interactive process management
- Live system health metrics
Advanced Analytics:
- Restart pattern analysis
- Failure rate calculations
- System anomaly detection
- Performance trend monitoring
Interactive supervisor management interface:
- Real-time supervision tree visualization
- Process killing with immediate feedback
- Live restart counter updates
- Process detail inspection
- Bulk operations support
System-wide monitoring and analytics:
- Process count and health metrics
- Anomaly detection alerts
- System topology visualization
- Performance analytics
- Export capabilities (JSON/CSV)
Comprehensive RESTful API with Arsenal Operations:
# Process Information & Control
GET /api/v1/processes/:pid/info # Get comprehensive process information (Arsenal)
DELETE /api/v1/processes/:pid # Kill/terminate process (Arsenal)
POST /api/v1/processes/:pid/message # Send message to process (Arsenal)
POST /api/v1/processes/:pid/trace # Start process tracing (Arsenal)
# Supervisor Management
GET /api/v1/supervisors # List all supervisors with metadata (Arsenal)
# Process Management
GET /api/v1/processes # List all processes (with pagination & filtering)
GET /api/v1/processes/:pid/state # Get GenServer internal state
GET /api/v1/processes/:pid/messages # Get captured message history
DELETE /api/v1/processes/:pid/trace # Stop message tracing
# Supervisor Operations
GET /api/v1/supervisors/:name # Get supervisor details & children
GET /api/v1/supervisors/:name/analytics # Get supervisor performance metrics
PUT /api/v1/supervisors/:name/strategy # Change supervision strategy
POST /api/v1/supervisors/:name/simulate-failure # Simulate failures
# System Analysis
GET /api/v1/system/health # System health metrics & status
GET /api/v1/system/graph # Process relationship graph
GET /api/v1/system/bottlenecks # Identify system bottlenecks
GET /api/v1/system/anomalies # Detect system anomalies
# Arsenal Metadata
GET /api/v1/arsenal/docs # OpenAPI documentation for Arsenal operations
GET /api/v1/arsenal/operations # List all available Arsenal operations
The project maintains production-grade test coverage with multiple testing strategies:
Test Categories:
- Unit Tests: Individual module functionality
- Integration Tests: Cross-module interaction
- LiveView Tests: Web interface behavior
- Property Tests: Edge case validation
- Performance Tests: System scalability
Advanced Test Patterns:
# Proper OTP Synchronization (no Process.sleep)
test "supervisor restart behavior" do
pid = start_supervised_child()
ref = Process.monitor(pid)
Process.exit(pid, :kill)
# Wait for actual termination
receive do
{:DOWN, ^ref, :process, ^pid, _} -> :ok
after 1000 -> flunk("Process did not terminate")
end
# Verify restart occurred
assert Process.whereis(:child) != pid
end
- Zero Sleep Dependencies: Uses proper OTP synchronization
- Sandbox Isolation: Prevents test interference
- Resource Cleanup: Automatic test environment reset
- Educational Value: Tests demonstrate OTP best practices
# Test Execution
mix test # Full test suite
mix test --cover # Coverage analysis
mix test test/path/specific_test.exs # Individual test files
mix test --trace # Detailed test output
# config/config.exs
config :otp_supervisor, OTPSupervisor.Core.AnalyticsServer,
scan_interval: 2000, # Supervisor scan frequency
max_events_per_supervisor: 1000, # Event history limit
enable_detailed_logging: true # Debug information
# Sandbox creation with custom options
SandboxManager.create_sandbox("test", TestDemoSupervisor,
strategy: :one_for_all, # Supervision strategy
max_restarts: 5, # Restart intensity
max_seconds: 60, # Restart period
custom_config: %{key: "value"} # Application-specific config
)
# Advanced tracing options
MessageTracer.trace_messages(pid,
message_limit: 1000, # Maximum captured messages
pattern_matching: true, # Enable pattern analysis
include_system_messages: false, # Filter system messages
auto_cleanup: true # Automatic cleanup on process death
)
- Supervision Trees: Hierarchical fault tolerance
- Let-It-Crash Philosophy: Embracing failure for robustness
- Process Linking vs Monitoring: Bidirectional vs unidirectional coupling
- GenServer Patterns: Stateful server process implementation
- Registry Usage: Process discovery and naming strategies
- Application Architecture: Proper OTP application structure
- Graceful Degradation: System behavior under failure
- Resource Management: Bounded data structures and cleanup
- Performance Monitoring: Low-overhead system introspection
- Fault Isolation: Preventing cascade failures
- Configuration Management: Runtime reconfiguration patterns
- Foundation: Explore supervision tree visualization
- Experimentation: Kill processes and observe restart behavior
- Analysis: Use message tracing to understand communication
- Implementation: Study source code for pattern understanding
- Practice: Create custom supervisors using learned patterns
- Mastery: Implement production-grade OTP applications
- Minimal Overhead: Analytics system designed for production use
- Bounded Resources: Automatic cleanup prevents memory leaks
- Efficient Monitoring: O(1) lookups with ETS tables
- Scalable Architecture: Handles large supervision trees
- Comprehensive Error Handling: Graceful degradation under all conditions
- Resource Cleanup: Automatic management of system resources
- Configuration Hot-Reload: Runtime configuration updates
- Monitoring Integration: Built-in metrics and health checks
- Elixir: 1.14+ (OTP 24+)
- Phoenix: 1.7+
- PostgreSQL: For session storage and development
- Node.js: For asset compilation
git clone <repository>
cd otp_supervisor
mix deps.get # Install Elixir dependencies
mix ecto.setup # Setup database
npm install --prefix assets # Install Node.js dependencies
mix phx.server # Start development server
- Documentation: Comprehensive module and function documentation
- Testing: All new features require test coverage
- Performance: Consider performance impact of new features
- Educational Value: Code should demonstrate OTP best practices
- Fork & Branch: Create feature branches from main
- Test Coverage: Ensure all new code has corresponding tests
- Documentation: Update README and module docs as needed
- Code Review: Submit pull requests for review
- Educational Focus: Maintain the platform's learning objectives
This platform represents a sophisticated, production-ready educational tool for mastering OTP supervision concepts through hands-on experimentation and real-world pattern implementation.