Collating information about outages for Incident Reports #2791

balajialg · 2021-09-23T00:42:47Z

All,

I wanted to collate all the information about the outages with varied hubs in a single place. I see this useful from multiple perspectives,

Help us write consolidated incident reports in the future
Verify @felder's GCP queries
Evaluate whether the outage caused is due to an issue which we already fixed

Date	Hubs/Services Affected by the Outage	User Impact	Reasons
August 20th, 2021	Datahub RStudio, dlab.datahub RStudio	300+ students as part of the R workshop	Due to this issue (#2585), this PR was created. Jupyter Client went through a major upgrade which broke the system.
August 26th, 2021 (First day of class)	R hub, Datahub	Stats, Econ students were not able to log into their hubs	Due to this issue (#2628), this PR (#2629) was created. Related to blocking request for course scope through the canvas.
September 2nd, 2021	Data 100	around 10+ students	This issue (#2688) was due to the addition of the voila package
September 13th, 2021	Prob 140	No Data on the impact of the outage	Check the PR that fixed this issue here (#2749)! The size of the DB was full due to logs.
September 16th, 2021	Data 100	50+ students reported issues with their Hub instance	Hub restarted with a delay after a PR(#2768) got merged resulting in an interim outage for users
September 29th, 2021	EECS Hub	All students in EECS 16A lab reported memory-related issues with their Hub instance	NFS disk was full resulting in this error. Issue description and solution can be found in this issue (#2808)
October 19th, 2021	R Hub	All users in the R Hub	storage problem with the hub resulting in this error. Issue description and solution can be found in this issue (#2902 )
January 20th, 2022	Many hubs	Most GSIs across multiple hubs	For more information, refer here
February 2nd, 2022	Data 100 hub	Minor outage for a few students	PR merge to prod triggered the pods to be knocked out of the hub
August 8, 2022,	All hubs	Outage that affected all hubs including Data 8 students	@yuvipanda fixed the core node issue by killing the core node which resulted in the outage
August 23, 2022,	Data 100 hub	Outage that affected some Data 100 instructors and students
Sep 5, 2022	Data 100 hub	Outage that affected a few Data 100 students
Sep 11, 2022	Data 100 hub, Biology Hub	Outage that affected all hubs
Sep 12, 2022	Stat 20 hub, R Hub	Outage that affected students using R Hub	Issue details are in #3740
Sep 14, 2022	Data 102 Hub	Outage that affected few students
Sep 15, 2022	Stat 20 hub, Data 100 Hub, R Hub	Outage that affected all the hubs
Sep 18, 2022	Data 100 Hub	Outage that affected all the hubs due to NFS server issue	NFS restart brought hubs back
Oct 7, 2022	All hubs	Hubs down due to NFS server issue which affected all users for a short period of time	Yuvi restarted the NFS server which brought the hubs back
Oct 9, 2022	All hubs	Hubs down due to NFS server issue which affected all users for a short period of time	Yuvi restarted the NFS server which brought the hubs back
Oct 10, 2022	All hubs	PR moving all hubs to NFS v3.0 from v4.0 resulted in a crash that affected all users for a short period of time	Reverted back to the original state
Oct 12, 2022	Stat 20 hub	Start-up times went really high for the Stat 20 hub. #3836 is tracking this	Yuvi moved Stat 20 to a different node pool altogether
Oct 15, 2022	Data 101 hub	Users reported 403 error	process to delete inactive users resulted in race condition. Yuvi deleted the process which brought the hub back
Oct 27, 2022	Data 8 Hub	Data 8 Hub users not able to access their pods	Not able to recollect the reasons for this outage
Oct 30, 2022	Outage that affected all the hubs	Hubs were unusable for a short duration of time	Yuvi drained the nodes which had all the affected pods
Nov 11, 2022	Data 8, Data 100, and Data 101 Hubs are down	Hubs were unusable for most users for a short duration of time	Outage due to node auto scaler issue which is highlighted in this issue #3935
Dec 2, 2022	All hubs were down	Hubs were unavailable for all users for a period of 2 hours	Outage due to nginx related issue
Feb 24, 2023	All hubs were down	Hubs were unavailable for all users for a period of 30 mins - particularly disruptive for Data 8 hub	Outage due to nginx related issue
Sep 30, 2023	All hubs were down	Hubs were unavailable for all users for a period of 40 mins	Outage due to tcp OOM/nginx related issue
Dec 4, 2023	All hubs were down for 10-12 mins	Hubs were unavailable for all users for 15 mins	Outage due to tcp memory related issue
Dec 5, 2023	All hubs were down for 35 mins	Hubs were unavailable for all users from 11:10 - 11:40 PM	Outage due to tcp memory related issue
Feb 7 and 21, 2024	Users were getting "white screen" issue when they tried to log into Datahub	Datahub, Data 100, Data 8, Prob 140 users were getting this error message when they log into Datahub. Clearing cache, restarting server, incognito window, using another browser are the available options	There is no clarity around the reason for the issue. Piloting fork of CHP is considered a possibility but there is no definitive evidence around the root cause for the issue.
Feb 23, 2024	All hubs were affected	Core node restart caused intermittent outage 5 times between 8.30 - 9 PM.	Core node was being autoscaled down from 1 --> 0, which had the effect of killing and restarting ALL hub pods. @shaneknapp disabled autoscaling in the `core` node pool and pinned the node pool size to 1. Since then, we haven't had this issue again.
April 5, 2024	Multiple hubs such as Data 8, 100, 101 were affected while pulling the notebooks from github repositories	Jupyterhub upgrade to 4.1.4 and nbgitpuller upgrade to the latest version 1.2.1 broke nbgitpuller functionality	Fix to debump Jupyterhub 4.1.4 and nbgitpuller to 1.1.0 fixed the issue for users facing issues with nbgitpuller link.
Sep 10, 11 and 12 2024	Certain percentage of users across all hubs	CHP CPU spike resulted in hub restart causing a brief downtime (~5 to 10 mins)	Increased CHP memory to 3 GB plus filed an upstream issue with configurable-http-proxy maintainers to track this issue

The text was updated successfully, but these errors were encountered:

balajialg · 2022-12-01T01:43:33Z

@ryanlovett You had ideas about combining this issue with pre-existing after-action reports. Do you think #3539 seems like a viable next step or you had something else in mind?

ryanlovett · 2022-12-01T17:06:37Z

@balajialg I think every incident should be followed up by a blameless incident report, https://github.com/berkeley-dsep-infra/datahub/blob/staging/docs/admins/incidents/index.rst. Perhaps when there are outages, you can create a github issue which tracks the creation of the incident report and assign it to the admin with the most insight into it.

The reports should follow a template with a summary, timeline, and action items to prevent the issue from recurring. They should be published in the docs.

balajialg · 2022-12-01T20:50:15Z

@ryanlovett For the future, I will create an incident report template that any of the admins with insight can fill. That would make it easy to start filling AAR when an outage happens.

However, what about the outages reported during fall 22? Do we want to create one incident report that collectively summarizes learnings and scopes the next steps? I am not sure whether doing an individual AAR is possible given the scope of work required.

Possibly this can be a discussion item for the Monthly Sprint Planning meeting.

ryanlovett · 2022-12-01T21:05:06Z

Ideally each incident would have a separate report since there are often different factors. This semester there were outages due to core nodes, image pulling delays, and the file server. The problem with creating reports too far after the fact is that our memories are hazy.

Are AARs and incident reports the same thing? Our previous incident reports contained an "action item" section which sounds similar to "After Action" reports. Is an AAR part of an RTL protocol? Wherever the action items are placed, it'd be good if they're found in a single place.

balajialg · 2022-12-01T23:15:32Z

@ryanlovett Apologies for using AAR and incident reports interchangeably while meaning the same. There is an RTL protocol for sharing a detailed outage template with relevant information to the leadership. However, that is more about the logistics of resolving the outage. It doesn't focus on the technical specifics of the incident report.

If @shaneknapp has the bandwidth and is interested then we can publish an incident report for fall 22 which outlines the issues due to a) core nodes, b) file server, and c) image pulling delays and the steps we took in the near term to resolve the outage and the plans we have for the long term to eliminate the reasons for such outages.

balajialg · 2023-01-24T21:01:26Z

Review the data once again!

This was referenced Sep 24, 2021

Automatically watch for errors in our infrastructure logs #2693

Closed

Weekly Update - Monday, September 27th #2799

Closed

balajialg self-assigned this Sep 30, 2021

balajialg added the documentation Issues around adding and modifying docs label Sep 30, 2021

balajialg changed the title ~~[For Incident Reports] Collating information about outages which happened during the past two months~~ [For Incident Reports] Collating information about outages Jul 25, 2022

balajialg changed the title ~~[For Incident Reports] Collating information about outages~~ Collating information about outages for After Action Reports Nov 28, 2022

balajialg mentioned this issue Nov 28, 2022

Consolidated After Action Report for all outages during Fall 22 Semester #3539

Closed

2 tasks

balajialg changed the title ~~Collating information about outages for After Action Reports~~ Collating information about outages for Incident Reports Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collating information about outages for Incident Reports #2791

Collating information about outages for Incident Reports #2791

balajialg commented Sep 23, 2021 •

edited

Loading

balajialg commented Dec 1, 2022

ryanlovett commented Dec 1, 2022

balajialg commented Dec 1, 2022

ryanlovett commented Dec 1, 2022

balajialg commented Dec 1, 2022 •

edited

Loading

balajialg commented Jan 24, 2023

Collating information about outages for Incident Reports #2791

Collating information about outages for Incident Reports #2791

Comments

balajialg commented Sep 23, 2021 • edited Loading

balajialg commented Dec 1, 2022

ryanlovett commented Dec 1, 2022

balajialg commented Dec 1, 2022

ryanlovett commented Dec 1, 2022

balajialg commented Dec 1, 2022 • edited Loading

balajialg commented Jan 24, 2023

balajialg commented Sep 23, 2021 •

edited

Loading

balajialg commented Dec 1, 2022 •

edited

Loading