Cloudflare deployed a change to its global network, taking the busiest 19 locations offline (accounting for about 50% of total traffic passing through Cloudflare). This outage propagated to the Stellate GraphQL Edge Cache which uses Cloudflare Workers under the hood.
Traffic passing through Stellate POPs (provided by Fastly) which routed to affected Cloudflare locations saw increased error rates and outages. This affected all Stellate services, no matter if GraphQL Edge Caching was enabled or not.
Since we use our GraphQL Analytics service for internal APIs, our dashboard was affected by the outage as well.
The Stellate Purging API also runs on Cloudflare Workers and was unavailable in affected locations.
Lastly, we observed failed attempts for users trying to log in to the dashboard via email. Our endpoint errored due to the WorkOS API (used internally to power magic login links) returning an error. WorkOS also mentioned a “degraded service” incident on their status page that aligns with the timing of the Cloudflare outage.
Timeline (all times in UTC)
On 2022-06-21, around 6:40 am we started getting customer reports about our CDN service being unavailable
Around 6:52 am we linked this to the Cloudflare incident
At 7:03 am an incident was opened at Stellate for a failing part of our internal system
Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal
Short-term solution
We improved our internal monitoring to check more locations. This will help us spot partial outages of our CDN services quicker in the future.
We made the email login endpoint more resilient to outages of WorkOS.
Future plans
Already before the incident today we were planning on consolidating our CDN service and reducing the dependencies on third-party providers like Cloudflare.
Posted Jun 21, 2022 - 15:49 UTC
Resolved
- Around 6:40 am we started getting customer reports about our CDN service being unavailable - Around 6:52 am we linked this to the Cloudflare incident - At 7:03 am an incident was opened at Stellate for a failing part of our internal system - Around 7:20 am Cloudflare implemented a fix, in the minutes after that we saw our services returning back to normal