In this article, we’ll discuss a few very simple approaches to keep your Azure Cloud Roles stable proactively and reactively. CloudMonix is a tool that helps with these approaches. Since, unstable instances often need to be rebooted, we’ll also discuss methods to ensure that reboots do not cause issues or outages.
- Proactive stability – rebooting instances on a daily basis
- Reactive stability – rebooting misbehaving instances when issues occur
- Handling server restarts – ensuring that reboots do not cause outages
Proactive stability – DAILY REBOOTS
Probably the simplest and most effective method to keep your applications stable is to reboot your cloud role instances on a regular basis. No matter how few memory leaks your application has, the longer your server is running without reboots, the more issues and performance degradations creep up. Memory and disk fragmentation, poorly closed connections, obsolete data in cache, large temporary folders, and of course memory leaks — can all cause your application’s performance to slowly degrade over time. Unfortunately, very few organizations reboot their Cloud Role instances on a regular basis, largely because it is not trivial to do in an automated fashion.
In CloudMonix, we’ve devised a clever way to reboot instances one at a time, without impacting the overall stability of the Role itself. The basic idea is to reboot every 24th instance in the beginning of every clock hour.
CloudMonix expression for evaluation that drives this Action is a simple line of C#-like code
CheckTimeUtc.Hour == (int.Parse(InstanceName.Substring(InstanceName.LastIndexOf(“_”) + 1)) % 24)
CheckTimeUtc is a special variable that CloudMonix tracks that represents the current time in UTC. InstanceName is another special variable that CloudMonix tracks when it evaluates action against every Azure cloud role instance. The rest is basic code to extract the last digits from the instance name and see if dividing that number by 24 returns the remainder that equals to current hour. This is simple and elegant.
This Action is built-in into the default CloudMonix profiles. It is disabled by default and named “Daily Reboot” action. Trial or Ultimate users of CloudMonix can simply check the Enabled checkbox to enable this action.
Reactive stability – REBOOTS ON DEMAND
Daily reboots are a great proactive measure for stability of Azure cloud role instances. But what happens when your application encounters critical issues throughout the day? Severe memory leaks, queued up or “stuck” IIS requests, hung processes, etc. can all lead to major instability of the application at random times during the day.
Recovery from such events can also be solved via reboots. Many CloudMonix users setup actions to reboot their servers when free memory is getting low or IIS requests start piling up. By default, CloudMonix will track available memory as a metric called MemoryFree and queued up requests as a metric named AspNetRequestsQueued. Setting up an action that reboots an instance when available memory drops below some threshold for a sustained amount of time is trivial in nature and takes a few seconds. In order to not run into an issue when all instances are being rebooted at the same time, it is also possible to add additional constraints to the expression and minimize those chances
Rebooting Without Outages
Rebooting Azure Role Instances may sound a little scary. What happens during reboot? Are users impacted? Is work lost? How can one ensure that things are stable during a reboot?
Azure makes it relatively simple to handle reboots of instances in a clean way. Do keep in mind that Azure will likely reboot all of your instances a few times per month, anyway, as a part of its scheduled updates – so handling reboots is necessary regardless of CloudMonix actions outlined above. Also, do keep in mind that Microsoft highly recommends that every Cloud Role has at least 2 instances running. It is so that one instance can be upgraded, rebooted, migrated, etc. while other(s) are handling live load.
When dealing with Azure Web Roles, reboots can sometimes be handled by the platform out of the box. During reboot of a cloud role instance, it is taken out of the load balancer first but not rebooted right away — this allows it to no longer receive any new inbound requests. However, it is still able to finalize web requests it is currently processing. This makes rebooting of Web Role instances relatively painless — if your web requests are processing quickly.
Azure Worker Roles (that execute jobs) and Azure Web Roles (with slower response times) may need to tell Azure to wait with the reboot, until all work is complete. This is done by overriding OnStop method in the WorkerRole class and ensuring that work is complete before allowing the method to exit. Do keep in mind that Azure will wait up to 5 minutes before it forces a reboot, so clean up your work fast.
A really great article by Rick Anderson at Microsoft is available here on the subject. It outlines a few lines of code that can be implemented to properly handle OnStop event in the WorkerRole.cs
CloudMonix is a cloud monitoring SaaS product designed specifically to handle monitoring and automation needs of Microsoft Azure users. It supports monitoring and healing a diverse number of Azure resources, such as Virtual Machines, Cloud Roles, Web Apps, Service Bus, Storage, SQL Azure, and more. Learn more here.