Microsoft's Windows Azure experienced a service disruption in several regions around the world that began on February 28 from 5:45 PM PST and lasted until 2:57 AM PST on February 29. During this time, the outage left several customers without access to their cloud applications. Microsoft has attributed the outage to a bug related to Leap Day.
Bill Laing, Microsoft's corporate vice president of the Server and Cloud Division, made a statement to explain the outage and Microsoft's efforts to resolve the problem. Here's an excerpt of the statement from Laing:
Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue. The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57AM PST, Feb 29th.
"We sincerely apologize for any inconvenience this has caused," Laing said. Despite Laing's apology, the consequences of this service disruption could potentially be severe for Microsoft. The outage lasted for eight hours and no official statement from Microsoft had been made up until Laing's statement. In addition, many organizations are hesitant to move to the cloud due to security and outages. Unfortunately for Microsoft, this incident reinforces the idea that instability is a danger that sometimes accompanies the cloud.
Finally, this situation highlights the fact that Microsoft didn't adequately test for this date test-case scenario. Although it might be easy to overlook the Leap Day scenario in a complex system such as Windows Azure, Azure developers should have taken care to adequately test this common date scenario in which Leap Day would undoubtedly cause problems for Windows Azure.