GitHub's four-surface outage exposes the shared-fate pro...

What happened

On the morning of the incident, GitHub's status page (githubstatus.com/incidents/xy1tt3hs572m) flipped four product surfaces to degraded at effectively the same moment: Pull Requests, Issues, Git Operations, and API Requests. The Hacker News thread climbed past 188 points within the hour, dominated by the usual genre of comment — engineers watching their CI queue stall, bots flooding retry logs, and Dependabot PRs piling into a void.

The incident itself was relatively short by GitHub's recent standards, but the surface area was the story. Four 'independent' product areas going red within the same minute is not four bugs — it's one bug wearing four hats. Git Operations covers the raw `git push`/`git fetch` path over HTTPS and SSH. Pull Requests is the merge-state machine on top of it. Issues is a separate-ish product surface backed by the same primary datastore. API Requests is the REST and GraphQL gateway that fronts all of the above. When all four degrade together, you are not looking at coincidence; you are looking at a shared dependency — usually the primary MySQL cluster, the auth service, or the rate-limiter fronting them.

GitHub's post-incident summaries over the last 18 months have repeatedly fingered the same suspects: replication lag on the Vitess-sharded MySQL fleet, a misbehaving auth path, or a control-plane change that cascaded through Spokes (their internal Git serving layer). We don't yet have the postmortem for this one. We do have the pattern.

Why it matters

The interesting question isn't "why did GitHub break." Everything breaks. The interesting question is why does your build pipeline assume it won't.

Walk through a typical mid-sized engineering org during one of these brownouts. The CI runner can't `git fetch`, so jobs hang until the 30-minute timeout, then retry, then hang again. Every CI system on Earth defaults to aggressive retries on git failures, which means a GitHub brownout doesn't just stall your pipeline — it actively makes GitHub's recovery harder by hammering them with retry storms from millions of runners simultaneously. A real example: during the January 2023 incident, GitHub's own status updates noted that retry traffic from CI systems extended the recovery window by an estimated 40 minutes after the underlying issue was resolved. The thundering herd is real, and you are part of it.

Meanwhile, your reviewers can't load PR diffs (Pull Requests surface down), your on-call can't file an incident in your GitHub-Issues-backed runbook (Issues down), your Slack bot can't comment on the PR to notify the author (API down), and your deploy tooling can't tag a release (Git Ops down). The four surfaces look independent in the marketing copy and on the status page, but they collapse together into a single point of failure the moment something upstream of all of them blinks.

This is the shared-fate problem, and it's structural. GitHub can't fully decouple these surfaces without rebuilding the product. Pull Requests *fundamentally needs* Git Ops to compute mergeability. The API *fundamentally needs* the primary datastore to serve writes. Issues *shares* notification infrastructure with PRs. The dependency graph is not a bug — it's the architecture. The honest read is: GitHub will continue to have multi-surface outages roughly quarterly, because the only alternative is rewriting the product, and they're not going to.

Compare this to the discipline you'd apply to your own infra. If your service had four critical subsystems and they all failed together every 90 days, you'd have a SEV-1 architectural review and a multi-quarter project to decompose the failure domains. GitHub's scale and product complexity makes that calculus different — and the upshot for you is that you are running production on top of a vendor whose blast radius is wider than you've been pretending.

What this means for your stack

A few concrete moves, in rough order of leverage:

1. Mirror your critical repos. Set up a read-only mirror on a second Git host — GitLab, Gitea, Codeberg, an S3-backed `git bundle`, whatever. Cron it every 15 minutes. The total cost is under an hour of setup and roughly $0/month; the upside is that a two-hour GitHub outage no longer blocks a hotfix deploy. Your CI can fall back to the mirror with a one-line URL swap.

2. Cap your CI retry behavior. Audit every job that runs `git fetch`, `gh` CLI calls, or hits `api.github.com`. Add exponential backoff with jitter, and cap total retry attempts at 3. Most teams have GitHub Actions workflows or Jenkins pipelines that will retry indefinitely on network failure — these are precisely the jobs that turn a 20-minute GitHub blip into a 90-minute outage for your team and a thundering herd for GitHub.

3. Decouple your runbooks from GitHub Issues. If your incident response process starts with "open an issue in the runbook repo," you have a circular dependency the moment GitHub goes down. Move your incident commander checklist to a static page, a Notion doc, or a printed PDF — anything that survives a GitHub outage.

4. Don't use GitHub as your status communication channel. If your status page is a GitHub Pages site, your customers can't reach it when GitHub is down. This is more common than you'd think.

Looking ahead

GitHub's reliability profile is, on average, very good — better than most teams could build themselves. But "on average" hides the shape of the failure mode, which is infrequent, correlated, multi-surface, hours-long. That failure shape is poorly matched to the way most engineering teams have wired GitHub into their critical path, treating it as an always-on utility rather than a vendor with a quarterly bad day. The fix isn't to leave GitHub; the alternatives have their own incidents and you'd lose the network effects. The fix is to stop pretending the four-surface outage is unusual, and start designing for it the same way you design for an AWS region going dark. Treat the next GitHub status page red square as a fire drill you've already rehearsed, not a surprise.

GitHub's four-surface outage exposes the shared-fate problem nobody talks about

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Incident with Pull Requests, Issues, Git Operations and API Requests

// community takes

GitHub's four-surface outage exposes the shared-fate problem nobody talks about

// tldr

// viewpoints

// deep dive

What happened

Why it matters

What this means for your stack

Looking ahead

// read from source

Incident with Pull Requests, Issues, Git Operations and API Requests

// community takes

// share this