Our application was down for 23 minutes in a single day but we have a SLA that only allows for 15 minutes per year.
Prevent downtime or signifigantly reduce downtime in the areas it cant be prevented, by improving our incident response process.
The URL shortener application was down because of a recent update made by a new hire. The application error resulted in a 500 internal server error, which caused the website to be unavailable.
A new hire committed version 2 of our application to the main branch, which had an incorrect usage of a JSON method.
json.loads(urls_file) #Wrong method json.load(urls_file) #Correct methodBelow, we can see that the code is nearly identical besides an single, additional wrong letter (.loads vs .load) in version 2.
Wrong JSON Method (v2) Correct JSON Method (v1)
Because this method cant work with a file, our application cant complete this task and our server responds with an error that made our application unavailable.
eb logs | grep -i -C 5 "error" > error_hunt.txt
The output from that command had overlapping, repetitive text so we tried the "awk" command remove them but we ended up using the "sort" command because we can logically think through how to get our result faster than researching "awk"
eb logs | grep -C 3 'error' | nl -w3 -s':' | sort -u -k2,2 | sort -n -k1,1 > error_hunt_filtered.txt
We spotted some lines mentioning errors related to JSON processing. We searched for json.loads() in our application.py file.
![]()
We couldnt find anything that stood out about the current application.py file, so we searched GitHub logs using...
git log -pThis is where we found of version 1 of application.py used json.load() but version2 used json.loads()
Version 2 Version 1
THIRD: Generate 4 stratagies to help us prevent downtime or signifigantly reduce downtime in the areas it cant be prevented...
6) We implement 2 preventitive fixes to our CICD pipeline. This will prevent any similar incedent from causing any amount of downtime.














