A scalable, high-availability web hosting architecture

Many web-hosting services start out small--everything runs on one machine, with not a lot of advance thought put into what happens when that machine's capacity is exceeded, or when that machine needs to be taken down for an upgrade.

We would also like to start out small. We can't invest in a big SAN or server farm for a service with an unproven user base. But we would like to know in advance how our architecture will scale with the userbase, and we would like our service to be available even when an individual component (or at least virtual component) which comprises it is not.

Using virtualization, we will create all of the separate operating system images we would need for a highly-scalable system on a single host. Although the system will initially still depend on that single host to be running in order to operate, we will be able to take down any of the individual guest images without interrupting service. Moreover, when it does come time to scale the service up, we will already be using most of the technologies necessary to do so.

Here are the components which will initially be running on the single host:

Front end: One or more guest images running the LVS load balancer. If we have multiple LVS front ends, we could either use round-robin DNS between them, or we could have the backup monitor the primary and steal its IP address.
Web servers: Two or more guest images running Apache. These are also the images users would log into in order to maintain their web content. These are the images for which any local root exploit potential is catastrophic, so we need two of them in order to be able to perform rapid kernel upgrades without service interruptions.
Home directory store: In order to keep the web server images stateless, we need to pick a technology allowing homedirs to appear the same on multiple servers. The best option at this moment appears to be GFS from Red Hat. It relies on a back-end block storage device accessible from each server instance; this can initially be a local disk partition, and can eventually be replaced by a RAID array or SAN.
Database servers: One or more guest images running MySQL. If we have multiple database images serving the same databases (which may or may not be technically feasible), we will need a backing store for those databases similar to the home directory store.

Other options for the home directory store include a distributed filesystem (NFS, AFS, or perhaps even CIFS), or some kind of semi-automated rsync setup where users make changes on one image and propagate them to the other images. GFS appears to provide the closest approximation to the local filesystem semantics expected by web applications.

More learnings about GFS. First, it's a lot more complicated than I anticipated. Here's the rabbit hole:

1. You can't just have several computers access the same block storage and cooperate implicitly. They have to talk to each other. In GFS, the first node to access a file becomes the lock manager for the file, and other nodes accessing the same file must synchronize with it.

2. If a node fails, it will fall out of communication with the others. In the ordinary course of events, this would cause any files it has grabbed to become inaccessible to other nodes.

3. To address this circumstance, each node expects to receive heartbeats from the other nodes every five seconds (by default). If four heartbeats are missed, the node is considered out of communication, and the cluster can begin the process of reclaiming its locks.

4. First, though, the offending node must be verified as dead. If it is not really dead (but has merely lost network or was very busy for a while), it could continue accessing the block storage device from outside the cluster and could cause corruption. The process of verifying that a node is really dead is called "fencing". To ensure consistency, no node can get a new lock in GFS while a fencing operation is taking place, which causes the filesystem to partially lock up for a while.

5. The preferred kind of fencing is called power fencing, and involves commanding a network power switch to cut power to the node temporarily. This involves buying a network power switch and figuring out how to get Linux to talk to it, which is Work, but it's the right solution in the long run. There is another reasonable alternative called fabric fencing, where the offending node is prevented from accessing the block storage device. This method is also kind of hairy because there's no standard interface for informing a SAN to stop paying attention to a particular computer for a while, and it certainly won't work with a locally mounted disk partition.

6. The dirty, totally unsupported kind of fencing is called manual fencing. Under this method, the system administrator is notified that a fencing operation is happening, and the GFS filesystem remains partially locked up until the administrator acknowledges that the node is really dead. If the sysadmin improperly acknowledges a manual fence, filesystem corruption results. In spite of these dangers, manual fencing is probably adequate for our initial setup, because in that setup we are only concerned with high availability in the face of planned failures (kernel upgrades), where one of the nodes is gracefully withdrawn from the cluster. Unplanned failures of only one node in the system should be exceedingly uncommon.

7. GFS per se does not take responsibility for all of the above communication. It is integrated with something called Red Hat Cluster Suite (in marketing speak) or the Linux Cluster Project (in geek speak), which is a collection of tools accomplishing the stuff mentioned above. These tools can also be used without shared storage, simply to do monitoring and failover, but are usually used in conjunction with GFS.

8. The identities of the other tools in RHCS has changed, and GFS is itself being superceded by a successor called GFS2. The Linux Cluster Project documentation appears to have kept pace with these evolutions, but it is a shifting landscape.

Second, GFS is used almost exclusively in conjunction with SANs. This makes sense, because virtualization is a fairly recent thing, and SANs are pretty much the only other way to make several computers able to access the same block storage device. It may be difficult to avoid the complexities associated with SANs when setting up GFS, even though we plan to use a local disk partition common to the several virtual images. We may even need to set up an extra guest image to act as a poor man's SAN for the other guest images.

Child pages

A scalable, high-availability web hosting architecture