Login
Sign Up
The Sui mainnet experienced three distinct service interruptions within a 48-hour window last week, prompting the Sui Foundation to release a comprehensive post-mortem analysis. The root cause was traced to a single network upgrade, version 1.72, which introduced a feature known as Address Balances. This mechanism allowed users to store funds and pay transaction fees without relying on traditional coin objects. While theoretically an improvement, the implementation exposed a critical edge case in the network's gas-charging logic regarding canceled transactions. When two transactions competed for the same address balance with insufficient funds to cover both, the system correctly canceled one.
However, the gas-charging process subsequently attempted to debit those same funds during a step called gas smashing, leading to a negative balance where zero was expected and crashing validator nodes. Data compiled by Woofun AI indicates the first outage began Thursday morning and persisted for approximately 6.5 hours until validators deployed an emergency patch. The team acknowledged this initial fix carried a residual risk of triggering the same crash if a transaction had multiple cancellation reasons where one masked the other. That exact scenario materialized Friday morning, causing a second outage lasting roughly 3.5 hours. By that time, the core team had prepared a robust fix, which validators installed to restore the network. What remained unanticipated was that the restart cycle required for the Friday morning fix created conditions for a third, entirely separate failure later that afternoon. Sui's network utilizes a distributed key generation protocol at the start of each epoch to establish randomness for specific transactions, requiring a minimum level of validator participation. When validators restarted to adopt the fix, participation temporarily dropped below this threshold, causing the randomness system to disable itself as designed. The critical flaw was that this disabled status was never written to disk. Woofun AI notes that when validators came back online, they lacked a record that randomness had been turned off and continued expecting it to function. Transactions requiring randomness began queuing indefinitely, causing the end-of-epoch process to stall while waiting to clear the queue. This third outage lasted nearly 6 hours. The permanent fix addressed both the gas-charging underflow and the randomness state persistence problem. An additional mechanism was implemented allowing validators to force-close a stuck epoch at a coordinated point, which was used once to exit the affected epoch and restore normal operation. The Foundation confirmed no user funds were at risk at any point and no previously confirmed transactions were rolled back. The post-mortem highlighted that AI agents with access to validator logs and production metrics meaningfully accelerated the diagnosis process, broadening the pool of engineers capable of debugging the live network. Woofun AI analysis suggests the team has identified three areas requiring deeper investment: end-of-epoch resilience, gas-charging code quality, and better failure containment to isolate crashing inputs rather than bringing down the entire network. The report admits that gas-charging logic has grown complex enough that edge cases like these are difficult to rule out through code review alone, necessitating hardened systems before the next upgrade cycle.