At 03.39 am UTC, our engineering team got alerted about an elevated number of errors in our CDN. While looking into the increased error rate, we noticed SSL handshake errors between the caching layer and the workers forwarding requests to origins.
Additional debugging surfaced that the SSL handshake failures were caused by the
graphcdn.app domain expiring. Later investigations revealed that while the domain was set to renew automatically, the credit card payment for the renewal failed. Additionally, we could not immediately get ahold of the person required to restore access to the
All requests sent to any
*.graphcdn.app subdomain were served an error page by the domain registrar. Because the CDN workers were internally using a
graphcdn.app subdomain (a leftover from our name change), this domain expiry caused all requests to fail, even if the external domain used was not a
*.graphcdn.app subdomain but a
*.stellate.sh subdomain or a custom domain.
graphcdn.appdomain, the engineering team started to move the CDN workers to a different domain.
*.stellate.shsubdomains or custom domains.
graphcdn.appdomain, which resolved the issue for any services using
graphcdn.app(The time those services were available again varied slightly depending on DNS propagation.)
The on-call engineers did not have access to the domain registrar where we registered the
graphcdn.app domain. We couldn’t get ahold of the person who had access to that registrar because they weren’t on-call.
While trying to find another way to reach that person, we deployed the first fix at 5:15 am UTC that removed the internal dependency on the
graphcdn.app domain to restore service for all custom and
We recovered access to and renewed the
graphcdn.app domain, and restored service for the
*.graphcdn.app subdomains at 5:50 am.
After resolving the incident, we conducted an internal post-mortem, analyzed the incident, and derived some immediate as well as future actions that are already completed: