Infrastructure

Sui Mainnet Network Stall Resolution

Thanks to rapid action by Sui engineers and the validator community, the issue behind yesterday’s mainnet stall was quickly identified, and normal network activity has been restored.

Sui Foundation

15 Jan 2026 • 3 min read

Summary

On January 14, 2026, Sui Mainnet experienced a prolonged disruption caused by an internal divergence in validator consensus processing. As a result, validators were unable to certify new checkpoints, and transaction submissions timed out.

The issue was unrelated to network congestion or transaction volume. Remote Procedure Call (RPC) reads continued to serve the last certified state throughout the incident (unless explicitly configured to not serve data from the nodes that have not been updated for some time).

The interruption was unintended, but the network’s safety-focused architecture behaved as designed. Importantly:

No certified state forks occurred and no certified transactions were rolled back.
User funds were never at risk.
The disruption was not related to network congestion issues or any outside threats.
No safety or consistency guarantees were violated.

The issue was detected and contained by Sui’s checkpoint certification and quarantine mechanisms, which prevented any user-visible fork at the cost of halting progress. The network resumed normal operation after validators deployed a fix and reprocessed consensus data.

User Impact

During the incident window, the network halted transaction processing to preserve safety, resulting in the following impacts:

Transaction submissions: Timed out while validators were halted
Transaction execution: No transactions were executed during the incident window
Reads: Continued serving the last certified state consistently
Funds & state: Safe; no rollbacks or double spends

Total user-visible disruption was approximately 6 hours.

What Happened

At a high level, the incident was caused by an edge-case bug in consensus commit logic related to handling conflicting transactions under certain garbage collection conditions, in which an optimization path caused different validators to reach different conclusions when computing consensus commits. The timing of the incident was coincidental and not caused by increased usage or related to network load.

Sui core stack has multiple layers of transaction processing:

Consensus produces an ordered stream of commits.
Deterministic execution turns those commits into checkpoints.
Quarantine ensures that no effects become finalized until certification succeeds. Think of it as an ephemeral layer for transaction execution that is not settled.
Checkpoint certification requires stake quorum to sign the same checkpoint digest.

During this incident:

A consensus bug caused different validators to derive different consensus commit outputs.
This led them to execute different candidate checkpoints.
Validators exchanged checkpoint signatures and observed that more than 1/3 of stake was signing a different digest, making certification impossible.
As a result, validators stalled to avoid finalizing inconsistent state.

This behavior prevented any user-visible inconsistency, but resulted in a network halt.

While disruptive, this is the intended failure mode for this class of issue: halt safely rather than risk inconsistent finalized state.

Recovery

Recovery proceeded in several stages:

Diagnosis and fix: The team identified the divergence, implemented a fix to purge incorrect consensus data from the point of divergence, and enabled corrected logic.
Canary deployment: Mysten Labs validators canaried the fix and verified correct behavior via logs and checkpoint production.
Validator rollout: Validators immediately elected to upgrade to the fixed binary, replayed consensus safely, and resumed checkpoint signing.

Resumption

Once a quorum of validators signed the same checkpoint digest, checkpoint certification and state sync resumed. The network returned to normal operation.

What We’re Improving

While this interruption was unintended, it confirmed Sui’s safety-focused architecture did what it was supposed to do: protect the network. Our goal for the future is to continue reducing recovery time when rare issues occur. Based on what we learned, we are making the following improvements:

Faster detection and recovery: We are improving our ability to pause consensus earlier once checkpoint inconsistencies are detected. Doing so will reduce the amount of data that needs to be replayed during recovery and significantly shorten time to restoration in similar scenarios.
Operator tooling: Recovering from this incident required careful manual reasoning about divergent internal state. The team is building safer, more automated operator tooling to identify and clean up inconsistent internal state in a controlled way, reducing manual effort and speeding up future recovery.
Testing and validation: We are expanding consensus-specific randomized testing to reliably reproduce this class of issue and validate fixes before deployment. Antithesis test configurations have already been updated to surface this scenario consistently.

Throughout the incident, Sui’s safety guarantees were preserved, and normal network operation has fully resumed.