Service disruption for Automated Persisted Queries (APQs)
Incident Report for Stellate
Postmortem

Incident

  • A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs.
  • Our E2E test suite should have caught this bug.
  • Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ.
  • The impact of this bug was not widespread enough to trigger alarms after release.
  • At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating.
  • On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved.

Improvements

  • We’ve fixed the bug in our E2E test suite for APQ.
  • We’ve agreed on a path forward to start monitoring GraphQL errors. The work has begun and is being tracked but has yet to be completed.
  • We’ve scheduled a rollback dry run for our following incident dry run to improve our institutional knowledge of rollback procedures and find potential improvements.
Posted Jan 12, 2024 - 13:39 UTC

Resolved
- A bug was released on Jan 8th at 1.43 pm UTC while improving Persisted Operation support. The two areas of code overlap, and unfortunately, the change broke support for APQs.
- Our E2E test suite should have caught this bug.
- Unfortunately, we recently made many improvements to our E2E test suite and silently broke the validity of the APQ E2E tests. These tests were running and reporting successes, but under the hood, they were erroneously being run against a server that does not support APQ.
- The impact of this bug was not widespread enough to trigger alarms after release.
- At 11.45 pm UTC, a customer raised an issue with APQs, and our engineering team started investigating.
- On Jan 9th at 2.05 am UTC, a fix was deployed, and the issue was resolved.
Posted Jan 08, 2024 - 13:45 UTC