Sui Mainnet Halts Resolved After Major Upgrade

Sui Mainnet Halts Resolved After Major Upgrade

Summary

On Thursday, May 28 and Friday, May 29, 2026, Sui Mainnet experienced three outage incidents. The first two stemmed from crash bugs involving the interaction of gas charging logic and the recent 1.72 release (which introduced address balances). The fix for the Thursday issue was an interim measure designed to restore functionality to the network while the Sui Core Team worked on a long-term solution. The interim fix had a known issue with a low probability of causing a halt. The team accepted the risk accompanying this proposal in order to bring the halted network back as quickly as possible while a robust fix was developed. On Friday morning, the network hit a variant of the known issue and halted.

The third outage occurred Friday afternoon at the next scheduled epoch change, when a separate latent bug — in how validators preserve randomness state across restarts — was exposed as validators restarted to adopt the Friday morning fix.

The first outage began at ~7am PT and ended at ~1:30pm PT on Thursday, the second began at ~5am PT and ended at ~8:30am PT on Friday, and the third began at ~1:30pm PT and ended at ~7:20pm PT on Friday. During the outages, no user funds were at risk, and the network did not revert any committed transactions when it resumed.

As of now, validators have fully addressed the known issues caused by both the original gas-charging bug and the randomness-state bug, and network activity has resumed.

What happened part 1: gas smashing on transaction failing with InsufficientFundsForWithdraw leads to underflow

Sui release 1.72 introduced address balances, which give Sui users a new way to store funds and pay for gas without using coin objects. Sui transactions can pay for gas using:

  •  an address balance, expressed as an empty list of input coins to the transaction 
  •  coin objects, expressed as a list of input coins to the transaction
  •  a mix of both (which we’ll call hybrid gas), expressed as a list of input coins and synthetic coin reservation objects that spend from an address balance.

In the second two cases, the runtime performs so-called gas smashing before charging the transaction for gas: combine all input coins into a single coin that is debited for gas. Importantly, gas smashing happens both for transactions that execute successfully, and transactions that are cancelled.

The root cause of the outage issues involved an edge case in gas smashing during hybrid gas payments. If a reservation attempts to overdraft an address balance while determining if the transaction has enough gas to meet its budget, the attempt is blocked and the transaction is marked as cancelled with InsufficientFundsForWithdraw. However, during subsequent gas smashing, the same reservation object attempts an overdraft again. Essentially, the transaction is cancelled because the address balance didn’t have enough funds, but then gas smashing spends those same funds. 

The above is an oversimplification in two ways:

  • The crash does not actually happen during gas smashing. To support concurrent withdraws from/deposits to the same address, using an address balance in a transaction emits balance deltas that are reconciled by a system settlement transaction that sums the deltas and determines the new values of each address balance. This is where the crash occurred, due to a negative delta from the gas smashing applied to a zero balance.
  • Encountering an InsufficientFundsForWithdraw is not as easy as trying to spend X from an address balance that holds <X (this would fail before consensus and not be included as a Sui transaction). It can only happen if two transactions that hit the scheduler at the same time are competing to spend the same funds from an address balance that cannot cover both. Cancelling transactions with this error is how the scheduler prevents overdrafts, but it cannot do this if the canceled transaction still debits funds due to gas smashing.

The fix is conceptually simple: don’t smash gas when a transaction is cancelled with InsufficientFundsForWithdraw. The core team proposed this fix, which brought the network back up following adoption by validators.

What happened part 2: InsufficientFundsForWithdraw masked by different error, same underflow hit

Changing gas logic is a delicate operation: 

  • As explained above, there are complicated interactions between address balances and coins.
  • Other than fixing bugs, gas logic changes must preserve all previous behavior or use appropriate version gating (otherwise, nodes may fork while replaying old transactions with the new logic)
  • Sui gas charging has conservation checks to ensure that no transaction can create or destroy SUI. This is a very valuable safety protection, but it also means that (e.g.) forgetting to credit any funds charged for gas to the appropriate place will cause a crash
  • Charging expensive transactions for gas is a key form of DoS protection

Any change to fix a bug must carefully weigh each of these factors. In addressing the Thursday crash, the core team made a decision to propose an interim fix to bring the network back as soon as possible, but had some known shortcomings. The fix was proposed at ~12 PT, and enough validators adopted it to bring the network back by ~1:30PT. This bought time to craft a more robust fix that addresses all of the above.

One such shortcoming is that a transaction may have multiple reasons for cancellation, and one reason can override the others. For example, a transaction using address balances waiting to touch a hot shared object might be cancelled due to too many higher priority transactions accessing the same object, then also be cancelled due to InsufficientFundsForWithdraw when a different transaction spends from the same address balance. In this case, the InsufficientFundsForWithdraw error may be masked by the other error, which bypasses the fix described above and causes the node to crash with the same underflow.

The scenario above occurred on Friday morning, leading to a second outage with the same root cause. At the time of the outage, the team was already close to completing the robust fix and finished in time to propose the fix to validators by ~8am PT, and enough adopted it to bring the network back by ~9:40PT.

What happened part 3: epoch change stalled due to disabled DKG

The network ran normally from ~9:40 PT until ~1:30 PT, when the scheduled epoch change failed to complete and the network halted a third time, due to a bug whose conditions were set by the previous restart cycle.

At the start of each epoch, Sui's validators run a distributed key generation (DKG) protocol that bootstraps the random beacon used by transactions that depend on on-chain randomness. DKG requires a higher participation threshold than normal consensus; in case participation is not high enough, then randomness is disabled for the rest of that epoch.

When validators restarted to adopt Part 2's fix, participation threshold wasn’t high enough for the next epoch's DKG, and it disabled itself as designed. But due to a latent bug, that failure verdict was never written to disk — as further restarts followed, each validator came back up unaware DKG had failed. While DKG runs, randomness-dependent transactions expect to either execute or be cancelled. With validators no longer remembering DKG had failed, neither could happen, the paused queue grew, and end-of-epoch logic — which must drain that queue before closing — was left waiting on DKG that would never come.

The fix PR had two parts:

  • Fixing a bug and persisting DKG status across restarts.
  • Adding a mechanism that lets validators close a stuck epoch at a coordinated point. This was used once to close the affected epoch; the network then proceeded into the new epoch normally and randomness was restored.

What we learned

  • End-of-epoch resilience. Sui already has a "safe mode" fallback for some parts of the epoch transition. Part 3 showed that the current pattern might be too narrow. This incident emphasizes the importance of understanding that end-of-epoch resilience is an area the ecosystem needs to invest in further, both by extending graceful-degradation patterns across the rest of the reconfiguration path, and by maturing the force-close mechanisms into a standing operational capability.
  • Gas charging logic. The crashes in Parts 1 and 2 both stemmed from bugs in gas charging, a corner of execution that interacts subtly with the address-balance settlement system, conservation checks, and the scheduler. Today, this logic is complex enough that edge cases like the ones we hit this week are hard to rule out by inspection alone. Coming out of this incident, we believe gas charging deserves the same care, thoughtfulness, and code-quality bar as the Move VM or Mysticeti consensus. aking it cleaner, more modular, and amenable to exhaustive invariant testing is an area worth deeper investment.
  • AI agents with access to production state, capable of interactively querying validator logs, inspecting cluster state, and assembling metrics on demand, materially accelerated diagnosis during this week's incidents. This tooling broadens the set of engineers who can effectively debug live production issues, and worthy of  continued investment.

Better failure containment. The crashes in Parts 1 and 2 were each triggered by specific inputs the validators could not process safely. Today, the system lacks a defense-in-depth layer that would bound the blast radius of such a crash. Coming out of this incident, failure containment is an area worth deeper ecosystem investment, in order to explore strategies that would let a validator skip or restart any input that has caused it to crash, so that a future bug of this class would at worst drop the offending input rather than halt the network.