Google pins weekend outage on “unexercised” feature

Google has apologised for a Cloud outage last week that knocked offline many of its own customers, saying the fault was down to a new feature that wasn’t properly tested – and promising to do better at communicating such incidents in the future.

On Thursday last week, outages on status trackers like Downdetector began to suggest an issue with tens of thousands of reports of outages for Google Cloud, Spotify, Discord, and more. After initial media reports, Google’s Cloud status page posted a short update about the outage.

The company has now released a full incident report, pinning the blame for the outage on a policy check system called System Control that was given a new feature last month to allow it to make extra quota policy checks for API requests.

“This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code,” Google explained. “As a safety precaution, this code change came with a red-button to turn off that particular policy serving path.”

“The issue with this change was that it did not have appropriate error handling nor was it feature-flag protected,” it added. “Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag-protected, the issue would have been caught in staging.”

The policy change was added to Service Control on the day of the incident. “This policy data contained unintended blank fields,” Google said. “Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.”

Wide impact

The outage hit a wide range of third-party companies, including OpenAI, Shopify, and MailChimp, as well as Cloudflare, which in turn impacted the customers of that web and cloud services provider.

Cloudflare said the Google troubles didn’t touch its core services, but did cause outages in a large set of critical services including Workers KV, Gateway, Stream, and its Dashboard.

“We’re deeply sorry for this outage: this was a failure on our part, and while the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them,” Cloudflare said in a blog post.

“This was not the result of an attack or other security event,” the post added. “No data was lost as a result of this incident. Cloudflare Magic Transit and Magic WAN, DNS, Cache, proxy, WAF and related services were not directly impacted by this incident.”

Seven-hour outage explained

Google said the incident was spotted immediately with engineers on the case within two minutes, and the cause identified within ten minutes, and a fix rollout after 40 minutes. However, the knock-on effects of the incident lasted several hours, starting just before 11 am US Pacific time and ending after 6pm on Thursday.

As the systems restarted, it created a “herd effect on the underlying infrastructure it depends on” – and that overloaded the infrastructure, Google said. “Service Control did not have the appropriate randomized exponential backoff implemented to avoid this.”

That took another two hours and 40 minutes to resolve – throttling the fix to avoid overloading the infrastructure and rerouting traffic elsewhere to reduce the load – but recovery time was longer for some Google and Cloud products depending on their architecture, the company said.

Going forward

Google admitted the incident would hurt its users and said it would take steps to avoid a similar outage.

“We deeply apologize for the impact this outage has had. Google Cloud customers and their users trust their businesses to Google, and we will do better,” Google said in a statement. “We apologize for the impact this has had not only on our customers’ businesses and their users but also on the trust of our systems. We are committed to making improvements to help avoid outages like this moving forward.”

Those changes include rejigging the architecture of Service Control so it’s modular, meaning any failures won’t take out entire systems, as well as improving its communications to help customers react more quickly – which may be a response to the fact that customers spotted the outage and tweeted about it well before Google made any public announcement.

Source link