Major Service Outage

Incident Report for Stellate

Postmortem

Leadup/Fault

At 03.39 am UTC, our engineering team got alerted about an elevated number of errors in our CDN. While looking into the increased error rate, we noticed SSL handshake errors between the caching layer and the workers forwarding requests to origins.

Additional debugging surfaced that the SSL handshake failures were caused by the graphcdn.app domain expiring. Later investigations revealed that while the domain was set to renew automatically, the credit card payment for the renewal failed. Additionally, we could not immediately get ahold of the person required to restore access to the graphcdn.app domain.

All requests sent to any *.graphcdn.app subdomain were served an error page by the domain registrar. Because the CDN workers were internally using a graphcdn.app subdomain (a leftover from our name change), this domain expiry caused all requests to fail, even if the external domain used was not a *.graphcdn.app subdomain but a *.stellate.sh subdomain or a custom domain.

Timeline (all times in UTC)

3:39 am - The first alert was triggered, and the engineers on-call were paged and started investigating the issue.
4:01 am - While working through our incident runbook, the engineering team noticed an error message regarding a “moved domain.”
4:12 am - As we couldn’t immediately get a hold of the person required to renew the graphcdn.app domain, the engineering team started to move the CDN workers to a different domain.
4:28 am - We opened an issue on our status page, https://status.stellate.co/incidents/m7v0bgflsg4c
5:15 am - We deployed the fix that moved the CDN workers to a different domain and restored service for all requests sent to *.stellate.sh subdomains or custom domains.
5:50 am - Restored access to the graphcdn.app domain, which resolved the issue for any services using graphcdn.app (The time those services were available again varied slightly depending on DNS propagation.)
6:34 am - Marked the incident resolved.

How did we resolve it?

The on-call engineers did not have access to the domain registrar where we registered the graphcdn.app domain. We couldn’t get ahold of the person who had access to that registrar because they weren’t on-call.

While trying to find another way to reach that person, we deployed the first fix at 5:15 am UTC that removed the internal dependency on the graphcdn.app domain to restore service for all custom and *.stellate.sh subdomains.

We recovered access to and renewed the graphcdn.app domain, and restored service for the *.graphcdn.app subdomains at 5:50 am.

Post mortem

After resolving the incident, we conducted an internal post-mortem, analyzed the incident, and derived some immediate as well as future actions that are already completed:

Immediate Actions

Validate that no other domains are expiring soon.
Audit and ensure all on-call engineers have access to all critical services.
Use a central email for authentication with any domain registrars.
Ensure there is an escalation policy to the founders and that the founders are permanently on-call.

Future Actions

Set up monitoring & alerts for expiring certificates & domains and audit our current monitoring setup for holes.
Audit and ensure all on-call engineers have access to all services (not just the critical ones).
Create a Customer Success on-call rotation, including guidelines on when and how to involve the Customer Success teams in ongoing incidents.
Set up monitoring and alerts for failed subscription payments.

Posted Sep 26, 2022 - 19:15 UTC

Resolved

This incident has been resolved and all services are working as expected again.

We will publish additional information on what triggered this incident, steps taken to fix it, as well as issues identified with our processes and how we plan to address them later today (European time zones).

Posted Sep 26, 2022 - 06:34 UTC

Update

Services on `graphcdn.app` domains are working again, though we still see some issues with DNS resolution for those domains from some providers.

Posted Sep 26, 2022 - 06:07 UTC

Update

We have deployed a fix for `graphcdn.app` domains and are waiting for the required DNS changes to propagate. All services should be working again shortly.

Posted Sep 26, 2022 - 05:55 UTC

Monitoring

All services using the stellate.sh or custom domains are back up and running again. Services still using the older graphcdn.app domains are still affected.

Posted Sep 26, 2022 - 05:16 UTC

Update

We are continuing to investigate this issue, which is causing all Stellate services to be unavailable at this time.

Posted Sep 26, 2022 - 04:46 UTC

Investigating

We are currently investigating an issue with our edge caching service.

Posted Sep 26, 2022 - 04:28 UTC

This incident affected: Dashboard, User API, and Admin API.