A legendary RubixKube™ internship puzzle.
Modern systems are dominos. One quiet failure can ripple across services and break things far away from the blast point. Your mission: simulate this world, detect what matters, reason to root cause, and explain your thinking.
This is not RubixKube. It’s about showing how you think, code, and reason about reliability. & Also It’s a proxy for how you think under ambiguity.
A program (CLI or tiny web service) that:
- Simulates a service graph over time (with random glitches)
- Detects anomalies against a threshold
- Traces downstream blast radius accurately
- Prioritizes remediation (root cause first)
- Explains the why with clear, human-readable output
- Records an incident log we can review
Run locally. No cloud costs.
-
Services graph: Directed acyclic graph (DAG) by default, but your code must handle accidental cycles gracefully (detect & report).
-
Input file: JSON array of services. Each service has:
name: string
(unique, case-sensitive)depends_on: string[]
(zero or more upstream services)health: number
(initial value in [0.0, 1.0])
Example (sample/services.json
provided):
[
{ "name": "service-A", "depends_on": ["service-B", "service-C"], "health": 0.98 },
{ "name": "service-B", "depends_on": ["service-D"], "health": 0.95 },
{ "name": "service-C", "depends_on": ["service-E", "service-F"], "health": 0.99 },
{ "name": "service-D", "depends_on": [], "health": 0.97 },
{ "name": "service-E", "depends_on": ["service-G"], "health": 0.96 },
{ "name": "service-F", "depends_on": ["service-G", "service-H"], "health": 0.94 },
{ "name": "service-G", "depends_on": [], "health": 0.99 },
{ "name": "service-H", "depends_on": ["service-I"], "health": 0.92 },
{ "name": "service-I", "depends_on": [], "health": 0.97 },
{ "name": "service-J", "depends_on": ["service-B", "service-I"], "health": 0.95 }
]
-
Ticks: The system evolves over
N
discrete ticks (iterations). -
Glitches: At most one random service may “glitch” per tick (unless you implement multi-glitch as a stretch). Glitch drops health by a random ∆ in
[0.2, 0.5]
. -
Propagation (downstream degradation): If an upstream dependency
U
falls belowthreshold
(default0.70
), each direct dependentD
degrades by:D.health = max(0, D.health - α * (threshold - U.health))
- where
α
∈[0.5, 1.0]
(choose and document your α; keep it constant for a run).
-
Recovery (optional): Auto-heal a failed service after a cooldown
K
ticks withheal_to
(e.g.,0.85
). Recovery must propagate upstream improvements downstream (reverse blast). -
Determinism: Support
--seed <int>
so the same inputs produce the same outputs.
You can externalize parameters in config.yaml
:
ticks: 50
threshold: 0.70
alpha: 0.8
cooldown: 1
heal_to: 0.88
seed: 1337
-
Anomaly: A service is FAILED when
health < threshold
. -
Blast radius: All downstream nodes reachable from a failed node (via reverse-topology).
-
Root cause(s):
- Prefer nodes with no failed upstream (i.e., failure originates there).
- For multiple roots, order by blast size (more impacted first), then by topological depth (closer to the graph roots first).
-
Cycles: If cycles are present, don’t crash. Detect and print a warning:
[WARN] Cycle detected: service-X -> service-Y -> service-X (RCA may be approximate)
Your tool should support:
run: ./domino --input sample/services.json --config config.yaml
query: ./domino "why is service-A failing?"
help: ./domino --help
Query semantics (examples):
why is service-A failing?
→ explain which upstream failed, when, and the chain.what happened in the last 10 ticks?
→ summarize incidents.top-impacted
→ list services by cumulative degradation.
If you build a web UI instead, document the endpoints (OpenAPI optional).
Write a human-readable incident log to ./sample/output.log
(example already provided), and print key events to stdout. Suggested lines:
[ALERT] service-G fell below threshold (0.62 < 0.70) at T=2
[BLAST] due to service-G → impacted: [service-E, service-F, service-C]
[PRIORITY] roots={service-G} order=[service-G]
[SUGGESTION] Remediate service-G first
[HEAL] service-G -> 0.88 at T=3; recovered: [service-E, service-F, service-C]
Machine-readable optional: also emit events.jsonl
with structured entries.
- Correct graph handling (including out-of-order nodes, isolated nodes, cycles)
- Determinism with
--seed
- Clear RCA (not just “who failed,” but why and who was hit)
- Readable code (modular, small functions, tests welcome)
- Great README (how to run, assumptions, trade-offs)
- Reasoning & RCA (35): root-cause logic, blast accuracy, edge cases
- Code quality (25): structure, naming, tests, docs
- Practicality (20): easy setup, deterministic, handles real-ish data
- Clarity of output (10): logs explain the story; a non-engineer can follow
- Creativity (10): visualization, CLI queries, auto-heal, metrics, etc.
We don’t punish incomplete—but we reward thoughtful.
- No external paid services or cloud infra
- No massive frameworks for a tiny CLI
- Don’t hide logic in black boxes—show your thinking
- Don’t copy/paste someone else’s solution (we can tell)
This challenge is open all the time — no fixed deadline. Use the 7-day window as your personal measure of fairness: whenever you start, try to finish within 7 days and be honest about it in your submission. If you crack it now, great. If not, learn from it and come back later. Our doors are always open for builders who love hard problems.
- Go to Issues → New issue
- Choose “Domino Effect Submission” link
- Fill the form (public repo URL required) and submit
A GitHub Action will clone your repo, run basic checks, and comment results.
Prefer a private submission? Send details to connect@rubixkube.ai.
- Keep your code under
/src
(or language-standard layout) - Put configs in
/config
(optional) - Put generated logs in
/runs/<timestamp>/
We’ve included:
sample/services.json
(input)sample/output.log
(example output).github/ISSUE_TEMPLATE/submission.yml
(issue form).github/workflows/validate_submission.yml
(basic checker)
- Topo-sort for propagation; fall back to Kahn’s algorithm with cycle detection
- Precompute reverse adjacency for blast tracing
- Keep an
event_bus
abstraction so you can log & query easily - Treat health as float with clamping
[0,1]
; avoid negative spirals - For multi-root RCA, think minimum cut intuition (but keep it simple)
If your solution shows strong reasoning, clarity, and care, that’s RubixKube DNA. If AI builds products, you help keep them alive.
Build it like you mean it. 🚀
Good luck. May the dominos fall in your favor.