In this article, we’ll discuss a few very simple approaches to keep your Azure Cloud Roles stable proactively and reactively.  CloudMonix is a tool that helps with these approaches. Since, unstable instances often need to be rebooted, we’ll also discuss methods to ensure that reboots do not cause issues or outages.


Breakdown

  • Proactive stability – rebooting instances on a daily basis
  • Reactive stability – rebooting misbehaving instances when issues occur
  • Handling server restarts – ensuring that reboots do not cause outages

Proactive stability – DAILY REBOOTS

Probably the simplest and most effective method to keep your applications stable is to reboot your cloud role instances on a regular basis.  No matter how few memory leaks your application has, the longer your server is running without reboots, the more issues and performance degradations creep up.  Memory and disk fragmentation, poorly closed connections, obsolete data in a cache, large temporary folders, and of course memory leaks can cause your application’s performance to slowly degrade over time.  Unfortunately, very few organizations reboot their Cloud Role instances on a regular basis, largely because it is not trivial to do in an automated fashion.

In CloudMonix, we’ve devised a clever way to reboot instances one at a time, without impacting the overall stability of the Role itself.  The basic idea is to reboot every instance once per day, triggering reboot at the beginning of every clock hour.

CloudMonix expression for evaluation that drives this Action is a simple line of C#-like code

CheckTimeUtc.Hour == (InstanceIndex % 24)

CheckTimeUtc is a special CloudMonix variable that represents the current time in UTC.  InstanceIndex is another special variable that CloudMonix tracks when it evaluates action against every Azure cloud role instance.  The rest is basic code to see if dividing that number by 24 returns the remainder equal to the current hour.  This is a simple and elegant way to ensure every instance is restarted once per day.

This Action is built-in into the default CloudMonix profiles, it’s called “Daily Reboot”. However, it is disabled by default.  Trial and Ultimate users of CloudMonix can simply tick the Enabled checkbox to activate this action.


Reactive stability – REBOOTS ON DEMAND

Daily reboots are a great proactive measure for the stability of Azure cloud role instances.  But what happens when your application encounters critical issues throughout the day?  Severe memory leaks, queued up or “stuck” IIS requests, hung processes, etc. can all lead to major instability of the application at random times during the day.

Recovery from such events can also be solved via reboots.  Many CloudMonix users setup actions to reboot their servers when free memory is getting low or IIS requests start piling up.  By default, CloudMonix will track available memory as a metric called “MemoryFree” and queued up requests as a metric named “AspNetRequestsQueued”.   Setting up an action that reboots an instance when available memory drops below some threshold for a sustained amount of time is trivial and takes a few seconds. In order to not run into an issue when all instances are being rebooted at the same time, it is also possible to add additional constraints to the expression and minimize those chances.

The Action rebooting instances one by one when RAM gets low is built-in into the default CloudMonix profiles, it’s called “Low Ram Reboot”. However, it is disabled by default.  Trial and Ultimate users of CloudMonix can simply tick the Enabled checkbox to activate this action.


Rebooting Without Outages

Rebooting Azure Role Instances may sound a little scary.  What happens during reboot?  Are users impacted?  Is work lost?  How can one ensure that things are stable during a reboot?

Azure makes it relatively simple to handle reboots of instances in a clean way.  Do keep in mind that Azure will likely reboot all of your instances a few times per month, anyway, as a part of its scheduled updates, so handling reboots is necessary regardless of CloudMonix actions outlined above.  Also, do keep in mind that Microsoft highly recommends that every Cloud Role has at least 2 instances running, so that one instance can be upgraded, rebooted, migrated, etc. while other(s) are handling the live load.

When dealing with Azure Web Roles, reboots can sometimes be handled by the platform out of the box.  During the reboot of a cloud role, the instance is taken out of the load balancer first but not rebooted right away, this allows it to no longer receive any new requests.  However, it is still able to finalize web requests it is currently processing.  This makes rebooting of Web Role instances relatively painless, assuming your web requests are processing quickly.

Azure Worker Roles (that execute jobs) and Azure Web Roles (with slower response times) may need to tell Azure to wait with the reboot until all work is complete.  This is done by overriding the OnStop method in the WorkerRole class and ensuring that work is completed before allowing the method to exit.  Do keep in mind that Azure will wait up to 5 minutes before it forces a reboot, so it’s necessary to quickly clean up any work.

A really great article on this subject by Rick Anderson at Microsoft is available here.  It outlines a few lines of code that can be implemented to properly handle the OnStop event in the WorkerRole class.


CloudMonix

CloudMonix is a cloud monitoring SaaS product designed specifically to handle monitoring and automation needs of Microsoft Azure users.  It supports monitoring and auto-healing a diverse number of Azure resources, such as Virtual Machines, Cloud Roles, Web Apps, Service Bus, Storage, SQL Azure, and more.  Learn more here.