Block Generation Suspension on Apr. 14-15
Description
On April 14 11:20 JST, block generation on Palette Chain stopped.
This problem was caused by shortage of the storage due to the increase of contract execution and transactions on Palette Chain.
In order to solve this problem, the consortium members expanded the storage capacity of each node and recovered block generation on April 15 13:30 JST.
Timeline of Events
Apr. 14:
11:00 JST: Disk error was detected on some validator nodes.
11:03 JST: Some nodes began to go down.
11:20 JST: More than 2/3 of the nodes are down. Block generation stoped.
12:00 JST: Expansion of storage capacity was completed for 9 of the 17 nodes, and these were restored.
Apr. 15:
11:34 JST: The number of nodes necessary to start block generation were restored.
12:30 JST: Stopped all nodes once for a simultaneous restart on all nodes to reset the round progression.
13:00 JST: All nodes restarted.
13:30 JST: Block generation restarted.
Cause of the Incident
As a direct cause, storage usage increased rapidly due to the unprecedented pace of contract execution and transaction volume by applications on the chain since around March 25.
In addition, there were processes that momentarily strained storage capacity due to the backup process, which caused the nodes to run out of disk space at that moment and the Geth process to halt, resulting in an unstable situation.
As a result, several nodes were down and less than 2/3 of the validators required for transaction approval were available, and block generation stopped.
Actions Taken
Expanded the storage space to cope with the increased usage.
Reviewed the backup acquisition process.
Changed the definition of thresholds for storage space usage and built alerts to take action before disk full.
Preventive Measures and Future Actions
Increase the number of monitoring targets and shift to a configuration that allows countermeasures to be taken before block generation stops.
Identify transaction types that have a high impact on transaction volume increase and optimize in coordination with the application side.
Last updated