Reproducing Amazon’s DynamoDB Service Disruption

On October 19th and 20th, 2025, many web services went down, including communication applications (Signal, Zoom, etc), collaborative tools (Asana, Slack, etc), video games (Fortnite, Roblox, etc), and so on. This outage also had consequences in the real world: several thousands of people could sleep due to erratic behaviors of their connected beds. The problem came from a race condition in the monitoring system of Amazon 53, the AWS domain name service. Shortly after the issue has been solved, Amazon Web Services published a detailed note on the incident¹.

Objective

The goal of this homework if to write a PlusCal model of the system described in the “DynamoDB” section of the note¹, and to reproduce the race condition that lead to the outage using model-checking. First, read the section “DynamoDB” from the note¹. Then, answer the questions in the next section. It will be useful to read the note several times while answering the questions below in order to get a good understanding of the system and the race condition that produced the incident.

Analysis of the incident

The system involves:

one Planner process,
several Enactor processes,
DNS plans, and
end points

The following questions are meant to guide the analysis of the note¹ and the production of the PlusCal model. All your choices must be justified.

How do the Planner and the Enactors communicate with each other?
What is relevant about the DNS plans in order to model the race condition?
1. How is modeled a plan?
2. How do you model the outdated plans?
3. How do you model the production of a new plan by the Planner?
What is the role of end points in the system regarding the issue to model?
Write a model of the Planner process
The note identifies three phases in the behavior of an Enactor: detecting if a new plan is available, enacting the new plan on each end point, and cleaning the outdated plans.
1. How do you propose to detect if a new plan is available?
2. How do you intend to model pushing the new plan to each end point?
3. Enacting the new plan to an endpoint may fail due to another Enactor updating the same end point. How to detect such a situation? How do you intend to handle it?
4. How do you know that all endpoints have been successfully updated?
5. How do you plan to model cleaning the outdated plans?
How do you formalise the property to check?
Use the model-checker to get a counter-example to the property, showing that your model is not correct. Compare the counter-example to the scenario in the note¹. Justify that your model allows to faithfully reproduce the issue described in the note¹.
How do you propose to fix this issue? Argue how your solution can be implemented in your model, and how it can fix the problem.

Expected outcome

You are expected to produce a file amazon_dynamoDB.tla containing your PlusCal model, a file amazon_dynamoDB.cfg used to model-check your model, and a file amazon_dynamoDB_answers.md containing your answers to the questions above, and written in the markdown file format.

Bibliography

Summary of the Amazon DynamoDB Service Disruption in the Northern Virginia (US-EAST-1) Region, AWS, https://aws.amazon.com/message/101925/ ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶