AMD has published (PDF) information about a bug that occurs with EPYC 7002 Rome server processors – it leads to a kernel freeze after 1044 days of continuous operation. That is, for correct operation, the server will have to be rebooted every 2.93 years. And AMD is not going to fix this error.
The problem is related to the fact that the core fails to exit the CC6 (Core C6 State) power-saving mode, which provides for lowering the voltage and lowering the frequency when idle. AMD clarified that the timing of the failure may depend on the Spread Spectrum modulation and the REFCLK frequency reference, which helps the chip keep track of time.
A plausible hypothesis about the cause of the error was put forward by Reddit user acid_migrain. According to his version, the error in reality manifests itself not after 1044, but after 1042 days and 12 hours. Timestamp Counter Scaling operates at 2800 MHz. By simple calculations, it turns out that 2800 × 106 × 1042.5 is approximately equal to 0x3800000000000000 – there are “too many zeros for this not to be a coincidence.” The problem has two simple solutions: either reboot the server every 1044 days (based on AMD information), or disable CC6 power saving mode.
The AMD EPYC Rome series processors were released in 2018, and it is possible that some of their owners have already encountered this problem. The manufacturer added that it does not plan to fix the error – perhaps it would require too high costs or it affects not so many customers.
If you notice an error, select it with the mouse and press CTRL + ENTER.