Issues with configuration updates propagating

Incident Report for Stellate

Postmortem

Stellate relies on Fastly infrastructure for our offerings
Fastly experienced a partial outage of their KV Store offering on June 17th and June 18th, which affected Stellate. They provide a summary of this incident on their status page at https://www.fastlystatus.com/incident/376022

Timeline

August 17th 10:46 UTC - A customer reported their stellate endpoint failing in the FRA (Frankfurt) point of presence (POP), as well as in several other edge locations. This was due to them pushing an update to their configuration, specifically the originUrl .
10:50 - We identified the issue as being a stale KV value in the FRA POP, as well as several others.
10:55 - We created an incident on our status page for degraded KV in the FRA POP and several others.
13:08 - We realized that Rate Limiting and Developer Portals were affected by this outage as well.
13:30 - We reported this incident to Fastly.
August 18th 4:00 UTC - Fastly was not yet able to provide us with a satisfactory response on what was causing this and didn’t acknowledge the ongoing outage.
6:23 - A large e-commerce customer reported their website was unavailable. This was due to a KV key disappearing in the FRA POP, as well as several others.
7:09 - Additional reports started to come in via Intercom about services not responding properly.
7:15 - We escalated the incident with Fastly as from our view more regions seemed to be affected and becoming unavailable.
7:16 - We deployed a partial fix that disabled our new infrastructure. This fixed edge caching for users who didn’t recently push configuration changes (the majority of services). Rate Limiting, JWT-based scopes, and the Developer Portal were still affected by the KV outage.
8:01 - Fastly was able to reproduce the bug based on a reproduction that we provided earlier and started working on a fix.
9:02 - Fastly opened an official incident on their status page.
10:04 - Fastly marked the incident as resolved
10:19 - Fastly communicated to us that the cause was an issue with surrogate keys in their C@E caching layer.
August 22nd - Fastly shared their confidential Fastly Service Advisory with us providing additional information about this incident and how they want to prevent this from happening again.

Next Steps

We have had several calls with Fastly over the last couple of days, working with them to analyze what went wrong, why it took them so long to escalate this internally, and how we can improve communication and collaboration going forward.
- As a direct outcome of this, we have re-connected with our European contacts at Fastly and designated a direct contact to involve in conversations and escalations going forward.
We are going to investigate a fallback option for Fastly KV.
Additionally, we will review all possible failure points that could make Stellate core services inaccessible (in the event of a third-party outage) and investigate options for additional redundancies for those services.

Posted Sep 11, 2023 - 10:53 UTC

Resolved

This issue has been resolved. We have temporarily switched all services back to our "old infrastructure" and are running additional tests as well as working with Fastly before we reopen the "new infrastructure".

We will also publish additional details once we conclude our internal post mortem process.

Posted Aug 18, 2023 - 12:17 UTC

Monitoring

Fastly has implemented a fix for the issue, all services are working as expected again.

We have temporarily disabled switching over to the new infrastructure and are working with Fastly to better understand what happened on their end, why it took so long to identify and rectify this and how we can better monitor and prevent this in the future. We well enable the new infrastructure again, once we are confident in any services we rely on.

Posted Aug 18, 2023 - 10:33 UTC

Update

We continue working with Fastly to resolve this issue. Please see https://www.fastlystatus.com/incident/376022 for updates from their team as well.

Posted Aug 18, 2023 - 09:53 UTC

Update

We are continuing to work on a fix for this issue.

Posted Aug 18, 2023 - 07:29 UTC

Update

The incident with KV stores, which are used for service configuration, is now spreading to additional edge locations and affecting overall service availability for services on the new infrastructure. We have disabled the new infrastructure to provide our partner more time to identify and resolve the issue on their end.

Posted Aug 18, 2023 - 07:15 UTC

Identified

Our infrastructure partner has identified the issue and is working on fixing it.

Posted Aug 17, 2023 - 16:03 UTC

Update

We are continuing to investigate this issue together with our infrastructure providers.

If you haven't made configuration changes to your service recently, you are not affected by this issue.

Posted Aug 17, 2023 - 13:56 UTC

Investigating

We are investigating an issue with configuration updates propagating to the respective services. If you didn't make configuration changes recently, your services are not impacted by this incident.

Posted Aug 17, 2023 - 11:55 UTC

This incident affected: GraphQL Edge Caching and GraphQL Rate Limiting.