At 03.39 am UTC, our engineering team got alerted about an elevated number of errors in our CDN. While looking into the increased error rate, we noticed SSL handshake errors between the caching layer and the workers forwarding requests to origins.
Additional debugging surfaced that the SSL handshake failures were caused by the graphcdn.app
domain expiring. Later investigations revealed that while the domain was set to renew automatically, the credit card payment for the renewal failed. Additionally, we could not immediately get ahold of the person required to restore access to the graphcdn.app
domain.
All requests sent to any *.graphcdn.app
subdomain were served an error page by the domain registrar. Because the CDN workers were internally using a graphcdn.app
subdomain (a leftover from our name change), this domain expiry caused all requests to fail, even if the external domain used was not a *.graphcdn.app
subdomain but a *.stellate.sh
subdomain or a custom domain.
graphcdn.app
domain, the engineering team started to move the CDN workers to a different domain.*.stellate.sh
subdomains or custom domains.graphcdn.app
domain, which resolved the issue for any services using graphcdn.app
(The time those services were available again varied slightly depending on DNS propagation.)The on-call engineers did not have access to the domain registrar where we registered the graphcdn.app
domain. We couldn’t get ahold of the person who had access to that registrar because they weren’t on-call.
While trying to find another way to reach that person, we deployed the first fix at 5:15 am UTC that removed the internal dependency on the graphcdn.app
domain to restore service for all custom and *.stellate.sh
subdomains.
We recovered access to and renewed the graphcdn.app
domain, and restored service for the *.graphcdn.app
subdomains at 5:50 am.
After resolving the incident, we conducted an internal post-mortem, analyzed the incident, and derived some immediate as well as future actions that are already completed: