During some of my past professional experiences, I have been responsible for a number of hosted applications and services. Back then we used the term Application Service Provider (ASP) to describe companies supplying Software as a Service (SaaS), as we know it today, solutions. I learned many lessons during this period in my career, not the least of which was how to give up control while remaining confident in the quality of our hosted offering. In fact, this is a very important lesson for any company hosting applications in the cloud today: You will give up some control. You must be confident in quality of service.

Your SLA: A Confidence Booster

The question is: How does one achieve confidence in quality of service? There isn’t an easy answer, but I can tell you that the one thing that gave me, my team, my company, and our customers the most confidence was the Service Level Agreement (SLA). I’m not just talking about a document with hopeful promises regarding quality of service—I’m talking about a document that is backed by solid reports and produced from meticulous network and system review. I’m talking about a document that can help you deploy a more reliable application rigged with appropriate diagnostic output. I’m talking about a document that internally drives the kind of strict process and procedure that is warranted for any mission-critical application. It is indeed your best friend for any hosted application—cloud or no cloud.

Any executive responsible for the viability of a hosted application has to rely on many team members to deliver; however, the exercise of writing a comprehensive SLA can give you that single point of insight into important details related to design, architecture, development, test, infrastructure, equipment, deployment, backup and recovery, monitoring, and more.

Essential Elements of an SLA

Key elements of an SLA include: typical business details, such as the period of the agreement and termination clause; requirements of the customer (yes, customers have obligations too, such as notification of peak loads forthcoming); and sometimes provisioning for consequential loss (damages to customer for non-performance) with the appropriate force majeure protection for the company. The meat of the SLA is in the description of services covered by the agreement (and those not covered), along with a description of system architecture, network and application security, network management and reliability, data integrity and confidentiality protection, system availability, performance and capacity planning, backup and recovery procedures, escalation procedures, and monitoring efforts. Whew! That is a lot of information! And I’ll bet you have read at least one SLA that didn’t come close to this level of detail—likely a one-page document with vague promises.

Why Don’t We See More Comprehensive SLAs?

There are a lot of reasons why companies do not produce comprehensive SLAs. Sometimes it is because the company doesn’t want to share details that could be used maliciously if they were to fall into the wrong hands. Sometimes the company's customers don’t demand a detailed SLA to be satisfied. Sometimes the company doesn’t have the resources to produce a comprehensive SLA.

But even if you don’t share your SLA outside the bounds of your company, you should seriously consider writing one as your internal guide to the reliability of your hosted application. The process of writing the SLA will make your application better. You will be better prepared for worst-case scenarios. I guarantee you will find issues in the process and save yourself from serious problems down the road.

SLAs in Action

Let me give you two concrete examples from experience. I was not responsible for the database backup and recovery process, but to support the SLA the team had to help me understand how database backup and recovery worked. We also had a commitment to test this process monthly to ensure recovery was possible from the logs. Wouldn’t you know, the first time we ran the recovery process it failed because of a problem with our process. Had we not tested this process because the SLA commitment required it, we might never have known we couldn’t recover from a database backup.

In another scenario, we had commitments to produce PDF documents from requests to our system and send them via email to recipients. These documents were to be delivered within a 15-minute time frame, per the SLA. During peak periods, we could receive tens of thousands of requests to process in batch, so we had to measure how long it took to write a single PDF and measure the capacity of a single machine to know our scaling capabilities. There's nothing like finding out that you can only generate a fraction of the PDF documents per hour that you thought you could! Fortunately, we were able to speed up the process with a better implementation and also increase our document generating machine farm to help with elastic scale.

Consider Costs, Staffing, and Trust

I started out saying the SLA was your best friend in the cloud, and so far I have really focused on hosted applications in general. What is the difference? I'll break it down between costs, staffing, and trust. Regarding costs, my experiences pre-cloud involved a large capital investment to inspire confidence that we could handle peak loads. (Trust me, all customers are 100 percent certain they will send you way more traffic than they ever really do, but you have to stand ready.) With the help of cloud providers, you have a much smaller capital investment and can scale up to meet peak loads and exponential growth. Costs and turnaround for equipment replacement can also be mitigated by the cloud provider.

Regarding staffing, we trained a rather large staff to manage every aspect of production operations, and though we still care about every aspect, we can use fewer staff so long as we are properly monitoring those aspects the cloud provider is responsible for.

Regarding trust, that is where the SLA comes in. Pre-cloud, we spent a lot of time proving to our customers that we could be relied upon for the security and reliability of the system. This does not go away when we leverage a cloud provider; in fact, it may get worse if the provider’s SLA isn’t adequate.

A Living Document

I am confident that if you approach the SLA the way I have in the past—as a living document that proves the viability of every aspect of your application and forces the team to think through every feature, from design and development to deployment and monitoring—you can come up with a solution for any concern your customers may have. You might have to rig your application differently to support monitoring and reporting; you will have to mix statistics from your cloud provider with statistics you have the ability to gather; and you might even have to blindly trust some of your cloud provider’s services that are not well described in your cloud provider’s own SLA. Still, the effort of writing the comprehensive SLA will be an excellent internal exercise that gives the entire company a sense of security and control over your hosted applications and services, and you’ll know where the risks are and how to respond to them.