Leverage a clustering solution to achieve high availability with SQL Server
Although high availability is typically an IT concern, there are many ways in which the deployment and maintenance of high availability solutions can impact development efforts. Furthermore, in cases in which applications or systems need to be architected to ensure high availability, developers need to have a solid understanding of the pros and cons of the various offerings and capabilities at their disposal—along with an understanding of how those pros and cons can impact their applications. This first article in a multi-part series will provide an overview of common high availability solutions provided out of the box by SQL Server.
Core Concepts of High Availability
At the most basic level high availability is about increasing the availability of an application or system—just as the term indicates. High availability is commonly used as a means to migrate or prevent disasters. Another facet of high availability is the need to make data and systems available to end users even when they’re offline. A perfect example is a traveling salesperson armed with sales leads on their laptop. Without connectivity, the data on this laptop would still be available in a disconnected fashion. But once connectivity is restored, any offline changes made by the user could be marshaled back to the home office, where any changes made there or by other traveling salespeople could be pushed back down to the local copy of data. In this sense, availability portends to more than just attempting to protect against disasters by means of ensuring additional redundancy.
In a similar measure, redundant data and systems doesn’t fully protect against disaster. Protecting against the potential problem that a single database, server, or data center might be lost is a key part of disaster recovery preparedness that's facilitated by high-availability solutions that create redundant data and systems. However, it’s important to realize that not all disasters are caused by the simple failure of a component or system. Instead, some disasters are caused by human error or other problems that corrupt data. That’s why I regularly council my SQL Server consulting clients to avoid confusing high availability with disaster preparedness.
Likewise, when it comes to implementing an high-availability solution, developers need to be aware of the kinds of downtime and data loss that are potentially associated with high-availability solutions. Although recovery time objectives (RTO) and recovery point objectives (RPO) are commonly considered to be IT concerns, developers who have a solid understanding of these concerns are able to craft better high-availability solutions or adapt to scenarios in which their solutions have been made redundant highly available. The following is a very high-level overview of clustering as a solution for achieving high availability with SQL Server.
Clustering has long been a hallmark of redundancy and is one of the first high availability solutions that comes to mind for many developers and IT professionals. Behind the scenes, multiple servers coordinate with a domain controller (DC) to share an IP address and virtual machine name, which means that if the hardware that currently hosts those resources goes down, other nodes in the cluster can assume control of those resources and continue operations with minor loss of service.
All in all, clustering is a complex technology, but it offers fantastic benefits, including the following:
- Transparent to developers. With clustering, developers point their applications at a virtual server that accepts inbound requests and queries just like a single physical server would. Yet, this virtual server and all associated resources and databases actually end up being dynamically hosted on fully redundant hardware. Consequently, failover is effectively transparent to developers that require no extra development effort to implement a clustering solution.
- Automatic failover and minimal downtime. With clustering, failover is automatic and typically occurs within a few seconds of the active host crashing or going offline. This is because specialized NICs let cluster member nodes (called quorum members) constantly keep tabs on each other for redundancy purposes. Even though failover is automatic, it’s important to realize that if failover takes 20 seconds to initiate, and another 10 seconds to execute, that’s still a total of 30 seconds in which queries can’t complete as expected. Likewise, as soon as a new quorum member takes control of virtual resources, the SQL Server service on this new host still needs to spin and take control of the underlying databases, meaning that it has to run through the RECOVERY process, which typically takes about one minute. (Although DBAs can tune this RECOVERY time back a bit on a server-wide level, SQL Server 2012 gives DBAs the option to provide more granular RECOVERY intervals through the use of indirect checkpoints.)
Clustering does have some drawbacks, though, including the following:
- Complexity. Although setting up SQL Server clusters is much easier than it used to be (and SQL Server 2012 will make it much easier to set up clusters in certain cases through AlwaysOn failover cluster instances), properly deploying, configuring, and managing clusters is a non-trivial operation for most organizations.
- A single point of failure. From an HA standpoint, most clustering implementations also suffer because they require access to a single, shared set of disk resources (commonly hosted in the SAN, but occasionally hosted as shared disks in a dedicated enclosure that are accessible by one or more hosts) that can become a single point of failure. To this end, clusters are a great option if you’re trying to address the potential for regular downtime in the form of Windows security patches of if you’re worried about the loss of a single host of system. They're not a perfect solution because disk failures can still occur and they’re not commonly used for creating multisite redundancy. However, in recent years, the notion of multisite redundancy has been heavily addressed by various stretch cluster implementations that can actually be used to stretch availability across multiple data centers or sites.
- Price. Clustering not only requires redundant hardware (which, all forms of HA require), but is also available only on Datacenter and Enterprise editions of Windows. SQL Server Enterprise Edition and Datacenter Edition are also commonly used in clusters. However, there's minimal support that's licensed for the SQL Server Standard Edition. Shared accessibility to disks can also be potentially expensive, although this is less a concern compared to previous years as more and more systems are connected to SANs. (It’s also worth noting that only active nodes need to be licensed for SQL Server, instead of both active and passive nodes.)
In the next article in this series, I take a look at log shipping and database mirroring in "SQL Server High Availability for Developers, Part 2". In part three of this article series, we’ll take a look at the new (and awesome) replacement for database mirroring in SQL Server 2012—AlwaysOn Availability Groups. We’ll also take a look at replication, a form of high availability that all developers should be familiar with simply because it lets multiple locations actively edit or modify data at the same time (unlike solutions covered in this article that merely mirror changes from a single location to other redundant locations).