Increased latency for some edge locations

Incident Report for Stellate

Postmortem

Leadup/fault

We observed a customer being targeted by a DDoS attack which exhausted a maximum limit on concurrent requests per Point of Presence as imposed by our service provider. We observed traffic at a volume orders of magnitude higher than usual. Mitigation was delayed due to requests being dropped at this frequency.

Impact

During this attack, the amount of traffic exceeded a location-wide limit on concurrent connections. This resulted in all traffic at these locations becoming degraded and the attack impacting more than the targeted customer. All of the services we host remained reachable and online during this period but were experiencing increased latency due to throttling.

Timeline (all times in UTC)

2022/04/28, around 10 am we observed an increase in traffic.
around 10:18 am we observed this traffic impacting service performance. This was confirmed by customer reports.
at 10:23 am we declared an incident and started our investigation and remediation process
around 10:35 am we identified a DDoS attack targeting a specific customer as the root cause
at 11:05 am, we confirmed a remediation plan with the affected customer and blocked traffic to their service.
at 11:07 am latency across our other services returned back to expected levels.
at 11:32 am the affected customer reduced routing traffic to our CDN and continued to work with us to bring their service back up.

‌

💡 The above diagram shows the traffic pattern we’ve observed with a horizontal marker showing the mean amount of requests we’d typically expect to see.

Short-term solution

We worked with the customer to temporarily stop routing traffic to our CDN, after informing them of the issue, to reduce the amount of traffic entering the affected Points of Presence.
We have shipped a per-service kill switch which allows us to block traffic to customer services if a customer says they’re unable to cope with a sudden influx of requests, as observed in DoS attacks.
We have talked to our infrastructure provider and raised our concurrent connection limits.
We learned from the specific DDoS latency and traffic patterns and are improving our monitoring to detect such patterns sooner.

Future plans

We are prioritizing allowing services to limit the kind of requests they’re accepting (e.g. non-GraphQL requests), which aims to block more traffic at the edge.
We are prioritizing implementing configurable rate limiting.

Posted May 03, 2022 - 10:23 UTC

Resolved

This incident has been resolved.

Posted Apr 28, 2022 - 13:46 UTC

Monitoring

Service metrics are back to regular levels. We are monitoring our systems closely and will post an update with a proper post mortem later as well.

Posted Apr 28, 2022 - 11:16 UTC

Identified

We have identified and deployed a fix and are monitoring performance.

Posted Apr 28, 2022 - 11:09 UTC

Investigating

We are looking into an issue with increased latency in some of our locations.

Posted Apr 28, 2022 - 10:29 UTC