Fault Domains in Azure IaaSv2

With the availability of IaaSv2 in Microsoft Azure, several new features are available that dramatically change the way resources are deployed and maintained. One profound change is the introduction of three fault domains for IaaSv2 virtual machines as opposed to two fault domains for IaaSv1 virtual machines. In the case of Azure, a fault domain is basically a rack of servers. A power failure at the rack level will impact all servers in the rack or fault domain. To make sure your application can survive a fault domain failure, you will need to spread your application’s components, for instance front-end web servers, across fault domains. The way to do this in Azure is to assign virtual machines to an availability set. Upon deployment but also during service healing, Azure’s fabric controller will spread the virtual machines that belong to the same availability set across the fault domains automatically. As an administrator, you cannot control this assignment.

If you deploy virtual machines in cloud services (IaaSv1 style), the maximum amount of fault domains is two which can present a problem. For instance, when you deploy a majority node set cluster with three nodes across two fault domains, it is entirely possible that the fault domain that hosts two of the three nodes fails. When that happens, the surviving node does not have majority and will go offline as well. For such deployments, three fault domains are a requirement to survive a failure in one fault domain.

Now that you understand what a fault domain is and the requirement for three fault domains, how do you get three fault domains in Azure? Well, you will need to deploy virtual machines using the IaaSv2 model. This model is based on Azure Resource Manager which also enables rich template based deployment of virtual machines, network interfaces, IP addresses, load balancers, web sites and more. Many Microsoft and community templates can be found at http://azure.microsoft.com/en-us/documentation/templates/

To get a feel for how such a deployment works and to check if your resources are spread across three fault domains, take a look at our Cloud Chat video:

%d bloggers like this: