Understanding
failover in azure cloud services:
The Service
Level Agreement (SLA) of azure for cloud services is 99.95 i.e. the compute roles
hosted on azure will have external connectivity at least 99.95% of the time. However in order to be compliant with this SLA figure
windows azure wants its user to maintain
at least 2 instance of the worker/web role.
This
article tries to understand this requirement and give an insight into the
architecture of compute roles on windows azure
First
let’s get familiar with the terms used in this article:
(i) Azure Fabric Controller is
responsible for provisioning and monitoring the condition of the Azure compute
instances. The Fabric Controller checks the status of the hardware and software
of the host and guest machine instances. When it detects a failure, it enforces
SLAs by automatically relocating the VM instances.
The Fabric Controller uses dedicated resources that are
separate from Azure hosted applications. It has 100% uptime because it serves
as the nucleus of the Azure system. It monitors and manages role instances across
fault domains.
The Azure Fabric Controller operates as
the kernel and framework for Windows Azure, as it manages all the nodes, which
includes servers, load balancers, switches, routers, etc.
(ii)Fault
Domain is a physical unit of failure, and is closely related to the
physical infrastructure in the data centers. The scope for a single point of
failure is referred to as a fault domain. In Windows Azure the rack of servers in
datacenters can be considered a fault domain.
Windows Azure Fabric is responsible to deploy the
instances of your application in different fault domains. Right now fabric controller
makes sure that your application uses at least 2 (two) fault domains, however
depending on capacity and VM availability it may happen that it is spread
across more than that. As of now developers have no direct control over how
many fault domains his/her application will use.

(iii)Upgrade Domain defines the logical
unit of deployment of an application. This concept helps Microsoft azure to
handle how different instances of a compute role is upgraded, it makes sure to
provide high availability of services during upgrade of an application. In
order to achieve this when instances of an application are upgraded on one
upgrade domain, the instances on the other upgrade domains keep on running.
When the upgrade for all the instances gets completed on this domain the same
process is repeated for instances in next upgrade domain and this step is
repeated until all instances of our application are upgraded.
By default
an application has 5 upgrade domains, but a user can change this value to a maximum of 20 upgrade domains.Also the upgrade
domain are always spread across fault domains and instances of web/worker role
are allocated to upgrade domain in a circular manner.
Introduction:
The
compute role (web/worker) instances in windows azure are stateless. This means
that in case of any fault /during upgrade a compute role’s instance might stop
at one physical server and when this happens another server will pick it up. This is
done through fabric controller which is used by windows azure to manage the
systems.
However
the storage in Azure does maintain state. In fact the data in our azure
storage is replicated at three places in a single datacenter. So if the code of
the compute role application is written in stateless manner i.e. the state is
saved using azure storage services then in case of failure to an instance the
fabric controller will start another instance at some other location and it
will just restart that transaction and keep working.
Case: Only one instance of compute role
is deployed on azure
In this
case the instance of the compute role is present on only one fault domain and
upgrade domain, so in case of any failure to the fault domain the service will
be down and when the fabric controller detects this failure it will create a
new instance of this instance at some other location but until this happens our
service will be down.
Similar will be the case when we make any update to the instance of the application
and redeploy it, because during upgrade all instances of the application within
an upgrade domain are stopped and then upgraded and restarted. Since in this
case we have only one update domain containing the only instance of our
application which will be down when the upgrade takes place, so our application
will be down for that particular time.
This is
the reason why the SLA for cloud service says that it is up for 99.95% only for
cases where two/more instances of the application are deployed.
Case: Two or more instance of
compute role is deployed on azure
In cases
when two or more instances of the compute role are deployed, the fabric
controller ensures that these instances are running on two or more fault
domains. Whenever an instance goes down there will be at least one other
instance still running and when the fabric controller notices that one of the
instance of the compute role has gone down it ensures to start another instance
at some other physical server. In this manner in case of failure to one
instance of the application the whole application does not go down.
During
upgrade of the application only the instances of one upgrade domain are stopped
at a time and upgraded , during this period the other instances keep on running
and the application does not have to go through any downtime.
References: