After a routine maintenance at approximately 20:40 UTC, we detected an abnormal level of load on one of our DNS clusters. We are currently still investigating the cause of the issue while also making sure to prevent similar problems from reocurring in the future. Approximately 10% of traffic was affected during approximately a 30-minute window as the situation returned to normal. We do run 3 independent anycast networks so we are currently checking why a small portion of the traffic was not handled correctly.
We want to apologize for the inconvenience to anyone affected.
[27th May 2021] Update:
We have determined the cause of the issue. During the maintenance, we have deployed a routing update to the DNS system designed to increase performance and reliability. Due to an issue with a different system, the update failed to deploy correctly, which caused 2 out of 4 nameserver clusters to become unavailable. Operating at highly reduced capacity, we experienced an unexpected load on the other two clusters which started to cause queries to time out.
The timed out queries resulted in resolvers sending even more queries, which gradually overloaded one of the nameserver clusters.
We have taken multiple steps in quality control and monitoring to make sure this does not happen again. We have installed additional capacity to all clusters to make sure every single one is able to handle the full load.
Unfortunately, this was a human error, but we are working with the team to make sure this does not happen again in the future.