The Cloud Way: Understanding failover in azure

Understanding failover in azure cloud services:

The Service Level Agreement (SLA) of azure for cloud services is 99.95 i.e. the compute roles hosted on azure will have external connectivity at least 99.95% of the time. However in order to be compliant with this SLA figure windows azure wants its user to maintain at least 2 instance of the worker/web role.

This article tries to understand this requirement and give an insight into the architecture of compute roles on windows azure

First let’s get familiar with the terms used in this article:

(i) Azure Fabric Controller is responsible for provisioning and monitoring the condition of the Azure compute instances. The Fabric Controller checks the status of the hardware and software of the host and guest machine instances. When it detects a failure, it enforces SLAs by automatically relocating the VM instances.

The Fabric Controller uses dedicated resources that are separate from Azure hosted applications. It has 100% uptime because it serves as the nucleus of the Azure system. It monitors and manages role instances across fault domains.

The Azure Fabric Controller operates as the kernel and framework for Windows Azure, as it manages all the nodes, which includes servers, load balancers, switches, routers, etc.

(ii)Fault Domain is a physical unit of failure, and is closely related to the physical infrastructure in the data centers. The scope for a single point of failure is referred to as a fault domain. In Windows Azure the rack of servers in datacenters can be considered a fault domain.

Windows Azure Fabric is responsible to deploy the instances of your application in different fault domains. Right now fabric controller makes sure that your application uses at least 2 (two) fault domains, however depending on capacity and VM availability it may happen that it is spread across more than that. As of now developers have no direct control over how many fault domains his/her application will use.

(iii)Upgrade Domain defines the logical unit of deployment of an application. This concept helps Microsoft azure to handle how different instances of a compute role is upgraded, it makes sure to provide high availability of services during upgrade of an application. In order to achieve this when instances of an application are upgraded on one upgrade domain, the instances on the other upgrade domains keep on running. When the upgrade for all the instances gets completed on this domain the same process is repeated for instances in next upgrade domain and this step is repeated until all instances of our application are upgraded.

By default an application has 5 upgrade domains, but a user can change this value to a maximum of 20 upgrade domains.Also the upgrade domain are always spread across fault domains and instances of web/worker role are allocated to upgrade domain in a circular manner.

Introduction:

The compute role (web/worker) instances in windows azure are stateless. This means that in case of any fault /during upgrade a compute role’s instance might stop at one physical server and when this happens another server will pick it up. This is done through fabric controller which is used by windows azure to manage the systems.

However the storage in Azure does maintain state. In fact the data in our azure storage is replicated at three places in a single datacenter. So if the code of the compute role application is written in stateless manner i.e. the state is saved using azure storage services then in case of failure to an instance the fabric controller will start another instance at some other location and it will just restart that transaction and keep working.

Case: Only one instance of compute role is deployed on azure

In this case the instance of the compute role is present on only one fault domain and upgrade domain, so in case of any failure to the fault domain the service will be down and when the fabric controller detects this failure it will create a new instance of this instance at some other location but until this happens our service will be down.

Similar will be the case when we make any update to the instance of the application and redeploy it, because during upgrade all instances of the application within an upgrade domain are stopped and then upgraded and restarted. Since in this case we have only one update domain containing the only instance of our application which will be down when the upgrade takes place, so our application will be down for that particular time.

This is the reason why the SLA for cloud service says that it is up for 99.95% only for cases where two/more instances of the application are deployed.

Case: Two or more instance of compute role is deployed on azure

In cases when two or more instances of the compute role are deployed, the fabric controller ensures that these instances are running on two or more fault domains. Whenever an instance goes down there will be at least one other instance still running and when the fabric controller notices that one of the instance of the compute role has gone down it ensures to start another instance at some other physical server. In this manner in case of failure to one instance of the application the whole application does not go down.

During upgrade of the application only the instances of one upgrade domain are stopped at a time and upgraded , during this period the other instances keep on running and the application does not have to go through any downtime.

References:

The Cloud Way

Monday, 4 May 2015

Understanding failover in azure

No comments:

Post a Comment

Blog Archive