Due to an error made during the development of a firmware update for Yandex Station smart speakers, an abnormally high load was created on NTP (Network Time Protocol) servers in the Russian segment of the Internet – these resources are used for time synchronization. The company did not discover the error immediately, but listed measures that would prevent its recurrence.
In mid-October, one of the volunteers who set up an NTP server on his home router discovered that the device’s channel was clogged with requests. Updating the firmware and rebooting did not solve the problem, but it disappeared after disabling NTP. It further turned out that since mid-October, 120 out of 140 Russian NTP servers have stopped working. The volunteer called on the Habr community to launch NTP servers on virtual machines from domestic providers as a temporary measure for a minimal fee – in addition to ordinary users, a large cloud operator responded and allocated 30 virtual machines at once.
The culprit turned out to be Yandex, which in mid-October began rolling out new firmware for the Station series smart speakers. The firmware of these devices contains a standard time synchronization client. In normal mode, it is performed every five hours, but if the attempt is unsuccessful, it is repeated after five seconds. Due to an error in one of the client-related modules, all devices with updated firmware began to synchronize time every five seconds, regardless of the result of the previous attempt – we recall that in the first nine months of 2024 alone, an estimated 3 million Yandex Stations were sold “
At the initial stage, Yandex deployed the firmware on 10% of devices – this is a standard measure in order to identify errors in the early stages. But the standard error detection scheme at that time did not have a metric for NTP requests, and by October 24, the firmware had spread to 100% of devices. The first complaints about an excessive number of NTP requests began to arrive on November 10 – this symptom is usually explained by problems on the user side, and due to the small volume of complaints, the priority of the problem was low. The error was discovered only on November 20 – by this time it was corrected and they began preparing a new firmware release.
But it was no longer possible to stall for time, because by the weekend of November 23 and 24, there were only four servers left on the Internet. Therefore, as a temporary measure, Yandex released a hotfix – an emergency update that increased the circulation period from 5 to 600 seconds. The load on NTP servers was thus reduced by 120 times, but if any of the Yandex Stations, after being turned on, could not synchronize the time on the first try, then in the next 10 minutes its time-related functions were unavailable. This helped stabilize the situation – by that time, members of the Habr community began launching NTP servers.
To prevent a recurrence of the incident in the future, Yandex decided to take several measures:
- allocate several company resources to a common pool of NTP servers;
- organize a separate NTP server zone for your devices;
- monitor NTP-related metrics when releasing new and updating old products;
- Improve user feedback mechanisms to better identify such problems.
If you notice an error, select it with the mouse and press CTRL+ENTER.