GomJabbar is a service inspired by Netflix's ChaosMonkey, but unlike ChaosMonkey, it was designed to work with your private cloud infrastructure (i.e. your own data centers).
The service exposes endpoints that allow you to randomly select targets, trigger a selected fault, and revert when needed.
You can find the Netflix explanation here. No point in copying that over ;)
The main idea is to reduce our fear from production (fear is the mind killer remember?). If you want to learn how to improve your code, monitoring, and alerting system, learn how to deal with production issues when you're awake and ready, this is the tool for you.
After running several chaos drills at Outbrain, I can assure you that doing this on a regular basis is extremely valueable. During a midnight page most people will not fix anything, nor investigate too far, and the incident will usually end with a service restart. During a chaos drill we look deeper into the root causes, and try to learn what we need to fix, and where we need to improve. After every drill we conduct a quick take-in and implement the fixes as soon as possible.
Running GomJabbar helps us validate our assumptions, our infrastructure, our resilience, and our fixes.
GomJabbar supports an extensible fault injection mechanism, along with a configuration based fault triggering commands and scripts. The example config file contains examples ranging from harmless failures to graceful / graceless shutdowns and traffic control (network issues emulation).
We currently integrate with consul out of the box, and provide a configuration based filtering for the targets. Future versions will integrate with other service discovery methods, and the tool was designed to easily support this.
Gom Jabbar now integrates with RunDeck, and Ansible. Future versions may provide other automation tools, or a built-in ssh capabilities / agents.