Skip to content

Site Partial Outage Summary (~5 hours) #385

Closed
@orta

Description

Re: #378

A deploy of the TypeScript v2 website, with somewhat novel re-directs structures, ensured that subsequent builds to the site would not be deployed correctly.

Effectively some files on the site were the beta version of the website, and others were the older version of the website.

This meant a lot of pages couldn't link between each other if they were a beta page, and that some links didn't work (like the playground, which relies on a lot of other files).

Timeline

  • 3/14/2020 - v2 deployed for testing over the weekend
  • 3/14/2020 - PR adds redirects for missing URLs to TypeScript v2 site
  • 3/15/2020 - Deploy switching back to TypeScript v1
  • 3/16/2020 🔒 - Noticed that the site was still v2, but working fine, previous deploys had passed local build tests but not deployed to servers
  • 3/17/2020
    • 12:00pm EST Tried deploying v2 from scratch to see if there was an issue with v1 code specifically
    • 12:00 EST first reports of pages with 404s start coming in
    • 1:00 EST Started internal thread that deploys weren't acting like expected
    • 1:10 EST Tried deploying an older build of v1 to see if there was an issue with new merged PRs
    • 1:30 EST Tried deploying last weeks working v2
    • 2:00 EST Tried deploying the week before thats v2
    • 2:30 EST Tried deploying v1 without the v2 subfolder, on the chance there was something specific to v2
    • 3:30 EST Tested a deploy of v1 to staging, which worked
    • 4:00 EST Started asking internally to our org about forcing access to the azure portal so we can read the deploy logs
    • 4:40 EST Got access to logs, diagnosed problem, started fix
    • 5:20 EST Full v1 deploy succeeded

Causes

We want to make sure that no links break in transitioning from v1 to v2. There's a lot of v1 links which aren't in active use, but also redirect to existing pages, so the v2 site has a section which looks like this:

export const veryOldRedirects = {
  Playground: "/play",
  Tutorial: "docs/home",
  Handbook: "docs/home",
  samples: "docs/home",
  "/docs/home.html": "/docs/home",
  "/playground": "/play",
}

export const handbookRedirects = {
  "/docs/handbook/writing-declaration-files": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/writing-declaration-files.html": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/writing-definition-files": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/typings-for-npm-packages": "/docs/handbook/declaration-files/publishing.html",
  "/docs/handbook/release-notes": "/docs/handbook/release-notes/overview",
  "/docs/tutorial.html": "/docs/handbook/release-notes/overview",
}

export const setupRedirects = addRedirects => {
  addRedirects(veryOldRedirects)
  addRedirects(handbookRedirects)
}

setupRedirects loops through the objects above it and tells gatsby that these redirects exist. These redirects are then emitted to the file system as /Playground/index.html which forwards you to /play via client-side JavaScript using the plugin gatsby-plugin-client-side-redirect.

During a deploy, CI pushes to a either the branch SITE-PRODUCTION / SITE-STAGING and then sends a webhook for Azure to pick up and deploy the static HTML from those branches (similar to how github pages works)

The deployment script in Azure failed when a file transitioned from a path like /docs/handbook/writing-declaration-files.html to instead be a folder with an index.html (/docs/handbook/writing-declaration-files/index.html).

This meant that every file alphabetically from the deploy had successfully migrated until the above was hit. Causing half the site to be in v1, and half the site to be in v2.

Resolution

Effectively we had been slow with setting up access to the Azure portal, which is where we would have been able to see build logs for deploys.

Deploys to the TypeScript site have been unpredictable on the azure side for quite a while, normally you can send another build down the pipeline and it fixes itself on the next run. This meant when a bad deploy happened, the first few answers were simply "let's send another build across" which is roughly a 30 minute process (~15m in CI, then ~5-30m in Azure) to verify

After a few cases of "send a deploy" didn't work and gave baffling results of the v1 index page and the v2 playground, then it started to look like getting access to the build logs was going to be the only answer.

@DanielRosenwasser asked for some help from someone who had been helping the TS team set up our Azure portal access (Thanks Antoni) to see if we could speed it up.

Once we had access to the build logs, it became quite obvious what the issue was:

Error: The target file "D:\home\site\wwwroot\docs\handbook\writing-declaration-files.html" is a directory, not a file.

KuduSync.NET from: 'D:\home\site\repository' to: 'D:\home\site\wwwroot'
Copying file: 'docs\handbook\writing-declaration-files.html'
Failed exitCode=1, command="kudusync" -v 50  -f "D:\home\site\repository" -t "D:\home\site\wwwroot" -n "D:\home\site\deployments\d05a2ce67eeb43eb7b1efb61d5add7bca7afa673\manifest" -p "D:\home\site\deployments\871b18365dbe5f2e572dbb5043fdf3de61c3af69\manifest" -i ".git;.hg;.deployment;deploy.cmd"
An error has occurred during web site deployment.
Error: The target file "D:\home\site\wwwroot\docs\handbook\writing-declaration-files.html" is a directory, not a file.\r\nD:\Program Files (x86)\SiteExtensions\Kudu\85.11226.4297\bin\Scripts\starter.cmd "D:\home\site\deployments\tools\deploy.cmd"

From there the files were deleted via the Azure console, and a triggered redeploy successfully got through making the site v1.

Post-Mortem

It took us 9 months to get other members of the team access to the Azure portal, which ironically, was supposed to happen earlier this morning but we had to re-schedule (given how wild everything is with COVID-19).

I didn't push the deadlines hard enough because we weren't seeing any problems, having portal access to change settings and see build logs is a "nice to have" when you think you're working with a static site hosting. It turns out that the site isn't running on cloud storage, but is an Azure App Service app which meant we own more of the hosting responsibilities than I had anticipated.

To my knowledge, this has been the first downtime since I've started working on the site - that sucks. Sorry folks.

Mitigation

I have a few direct TODOs to stop this happening again:

  • Update the redirect plugin in v2 to not create the /index.html when the file is already a *.html
  • Set up to deploy directly to Azure, instead of using the git integration
  • Add alerts when a build deploy fails into a teams chat room

For the long term:

  • Look into moving the TypeScript website to a static site CDN (which should be a perf boost, I hope too)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions