Sui Testnet Wave 1 came to an end on December 1, 2022 after a multi-week run. We successfully achieved our goal of practicing decentralized coordination and incident response on a geo-distributed Sui network with independent validators and node operators across eight time zones and 10 countries. We would like to express our utmost appreciation to the Sui Validators, operators, and users that made Wave 1 a success!
Here is what we worked on during Wave 1:
Genesis: Boot-strapping a decentralized network involves careful orchestration. With Wave 1 we successfully conducted a collaborative genesis ceremony with our validators to bring Sui online.
Monitoring: Observing network health is necessary for maintaining the health of any multi-node network, but doubly so when disparate, geo-distributed operators own these nodes. In Wave 1 we set up global monitoring to observe consensus health, networking health, throughput, and resource usage.
Communication: Operators need a communication channel for coordinating the genesis ceremony, asking about changing metrics, sharing issues, and learning about software patches. In addition, we need a way to coordinate updates and restarts. During Wave 1, we tried a single Discord channel for communication and found that this simple approach met our needs fairly well.
Mitigation: Maintaining network health requires responding to events in a timely manner. Wave 1 let us practice processes to detect, diagnose, and mitigate network events involving issues such as disconnection, misconfiguration, documentation error, node sync and catch up, consensus reliability, machine resource consumption, and transaction traffic surges.
Updates: Maintaining a healthy network can require applying live updates and patches. Throughout Wave 1 we rolled out three different software updates to mitigate issues as they arose. All operators were able to update to the new version with no downtime or data loss.
Total transactions processed: ~22 million
Total on-chain NFTs: ~11 million*
Total packages published: ~2,600*
Total coins dispensed by faucet: 251 billion MIST
Total requests served by Testnet faucet globally: 4.19 million
Wave 1 Incidents and Fixes
A testnet with no operational incidents would be a missed opportunity for learning to debug and mitigate issues in a live environment. Testnet Wave 1 presented its share of challenges for our operators, but fortunately we were able to understand these issues, fix them, and improve Sui (in many cases via improvements rolled out during a Wave 1 update).
Here are three memorable incidents:
- We addressed one consensus stall scenario where we saw validators gradually losing consensus liveness, which eventually led to a point where the network could not achieve quorum. Our multi-day debugging revealed that, during the Narwhal Byzantine broadcast, there was an edge case where a node may wait for the return of a request that was deduped and never sent out, leading to a livelock. Our team was able to roll out a fix for this edge case and gradually restore all the stalled validators to regain quorum. (This incident took place several days prior to opening up Testnet to the public on November 17, and we subsequently opted to launch a new network due to other technical reasons.)
- We addressed a scenario where newly restarted validators failed to rejoin and catch up to the latest state of consensus. Our team identified an edge case where the Narwhal consensus round number was incorrectly set to zero upon a restart instead of the correct consensus round number, leading to excessively slow requests that resulted in timeouts. We were able to patch this case, update lagging validators, and help them catch up while the network was live and operational.
- Last but not least, thanks to Wave 1 activities we were able to determine the root cause of a long-standing networking issue that contributed significantly to memory leakage. While we did not get a chance to apply and test this fix during Wave 1, the fix is now available in our upstream repository and will soon make its way to Devnet and Wave 2.
Mysten Labs' engineers worked tirelessly throughout Wave 1, and we will continue bringing our best efforts to test the Sui network during subsequent waves to ensure stable operation at Mainnet.
Next Step, Wave 2
With Testnet Wave 1 the Sui community took its first step on the journey of building a healthy and vibrant network. This effort paved the way for Wave 2, which will focus on epoch management, tokenomics, and stake delegation. We expect to launch this next Testnet Wave in early 2023.
The Sui community of validators and node operators made Testnet Wave 1 a tremendous success. We look forward to growing and refining the Sui infrastructure with yet another highly collaborative Wave 2 along with our community!