Sui Mainnet Outage Resolution
Sui engineers quickly diagnosed and issued a fix, subsequently deployed by validators, minimizing the network outage.
On the morning of November 21, 2024, between approximately 1:15 and 3:45 am PT, Sui Mainnet suffered a complete network halt. All validators were stuck in a crash loop, preventing all transaction processing.
What happened?
An assert!
in congestion control code (described below) erroneously caused validators to crash if the estimated execution cost was zero. All of the following conditions had to be met for this edge case to trigger:
- Congestion control configured to use
TotalGasBudgetWithCap
mode.- This was briefly enabled in protocol version 63 before being reverted, and then enabled again with the accumulating scheduler in protocol version 68.
- The network receiving a transaction with both:
- a mutable shared object input
- zero
MoveCall
commands
As soon as the network received a transaction with this shape, all validators crashed.
What is congestion control?
The Sui network’s object-based architecture allows us to process many different user transactions massively in parallel, in a way that’s not possible on most other networks. However, if multiple transactions are all writing to the same shared object, they must execute in sequential order, and there is a limit to how many transactions we can process that touch that specific object.
Congestion control, the system that limits the rate of transactions writing to a single shared object, prevents the network from becoming overloaded with checkpoints that take too long to execute.
We recently upgraded our congestion control system to improve shared object utilization by more accurately estimating the complexity of a transaction. The code for the new mode, TotalGasBudgetWithCap
, had a bug that caused this issue.
How did we fix it?
Once we identified the issue, the fix to the code was straightforward (PR #20365). This code fix was shipped to Mainnet in v1.37.4 and Testnet in v1.38.1. Thanks to an overwhelming response from our validators, it took only 15 minutes from the time the fix was released until the Sui network was back up and running.
What have we learned?
- Our incident detection and response systems are working well. Automated alerts fired to oncall engineers at the same time we heard reports of issues from the community, and we quickly had all hands on deck to diagnose and fix the issue. Thanks to swift work from the incredible community of Sui validators, the network was back online almost immediately after the fix became available.
- In an effort to prevent future bugs of this nature, we’ll be working on shoring up our testing systems to generate a wider variety of adversarial transactions like the one that triggered the crash.
- In an effort to reduce incident response time even further, we’ll be working on improvements to our build workflows to make debug and release binaries available more quickly. A significant portion of the outage time was spent waiting for the release to build.