Microsoft’s had significant difficulties recovering from its most severe Azure outage in years. On September 4, 2018 there was a weather related power spike at Microsoft’s Azure South Central U.S. region in San Antonio. That surge hit crippled their HVAC system. The subsequent rising temperatures triggered automatic hardware shutdowns. More than 30 cloud services, as well as the Azure status page were taken out in the process.
Microsoft restored their HVAC later in the day and began the process of turning the datacenter back up. Two days later though some services were still not fully operational. This incident should serve as a reminder that IT folks must take their cloud infrastructure design and disaster recovery into their own hands. So as not to appear to be Microsoft bashing, one of the availability zones in Amazon’s US-East-1 region took an hour and a half outage on May 31, 2018 when part of the datacenter lost power.
Going back to the Microsoft outage, Microsoft said that parts of the extended outage were cause by both internal Azure systems and customers’ attempts to redeploy services manually. Extended public cloud disruptions are rare, but they do happen.
2018 Cloud Outages
So far in 2018 Microsoft has had cooling and power issues in Europe, AWS had a S3 outage and the power issue above, and Google has had occasional hiccups. It’s been nearly three years since an Azure incident of this magnitude. This has made some organization complacent in their resilient infrastructure designs and disaster recovery planning.
Disaster Recovery Planning
The critical first step is to understand your cloud provider’s infrastructure. Most Azure regions are comprised of a single data center while AWS and Google deploy multiple availability zones in a region. Therefore, any hit to a regional data center won’t wipe out the whole region.
Knowing this, means that Microsoft customers looking for redundancy will need to put resources in two or more regions. AWS and Google customers can stay in a single region, but choose different availability zones. This becomes an even bigger issue when dealing with compliance and privacy laws in countries like Germany, where it may be illegal for data to reside outside the country.
The best solution though is to look at a multi-cloud strategy. The hedges against complete provider failure. It also allows customers to choose the provider with the best solution to their specific challenge.
Author Bio: Joe Goldberg is the Senior Cloud Program manager at CCSI. Over the past 15+ years, Joe has helped companies to design, build out, and optimize their network and data center infrastructure. As a result of his efforts, major gains in ROI have been realized through virtualization, WAN implementation, core network redesigns, and the adoption of cloud services. Joe is also ITIL certified.