Cloud Zone is brought to you in partnership with:

I'm a Windows Azure MVP, the principal developer for OakLeaf Systems and the author of 30+ books about Microsoft software. The books have more than 1.25 million English copies in print and have been translated into 20+ languages. Forbes Magazine ranked me seventh in their "Who Are The Top 20 Influencers in Big Data?" article of 2/3/2012. Roger has posted 37 posts at DZone. You can read more from them at their website. View Full User Profile

Windows Azure Outages in South Central US Data Center

04.20.2012
| 5787 views |
  • submit to reddit

My OakLeaf Systems Azure Table Services Sample Project (Tools v1.4 with Azure Storage Analytics) demo application, which runs on two Windows Azure compute instances in Microsoft’s South Central US (San Antonio) data center incurred an extraordinary number of compute outages on 4/19/2012. Following is the Pingdom report for that date:

imageimage

image

The Mon.itor.us monitoring service showed similar downtime.

This application, which I believe is the longest (almost) continuously running Azure application in existence, usually runs within Microsoft’s Compute Service Level Agreement for Windows Azure: “Internet facing roles will have external connectivity at least 99.95% of the time.”

The following table from my Uptime Report for my Live OakLeaf Systems Azure Table Services Sample Project: March 2012 post of 4/3/2012 lists outage and response time from Pingdom for the last 10 months:

image

The Windows Azure Service Dashboard reported Service Management Degradation in the Status History but not problems with existing hosted services:

image

[RESOLVED] Partial Service Management Degradation

19-Apr-12
11:11 PM UTC We are experiencing a partial service management degradation in the South Central US sub region. At this time some customers may experience failed service management operations in this sub region. Existing hosted services are not affected and deployed applications will continue to run. Storage accounts in this sub region are not affected either. We are actively investigating this issue and working to resolve it as soon as possible. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.

20-Apr-12
12:11 AM UTC We are still troubleshooting this issue and capturing all the data that will allow us to resolve it. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.

1:11 AM UTC The incident has been mitigated for new hosted services that will be deployed in the South Central US sub region. Customers with hosted services already deployed in this sub region may still experience service management operations failures or timeouts. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.

1:47 AM UTC The repair steps have been executed and successfully validated. Full service management functionality has been restored in the affected cluster in the South Central US sub-region. We apologize for any inconvenience this has caused our customers.

However, Windows Azure Worldwide Service Management did not report problems:

image

I have requested a Root Cause Analysis from the Operations Team and will update this post when I receive a reply.

Published at DZone with permission of its author, Roger Jennings. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)