The purpose of this guide is to answer one of the questions I receive most often:
How do I host my web applications in the cloud in a way that is redundant but also inexpensive?
Before you begin reading the guide, try to keep the following things in mind:
- Try to understand what an application is doing before blindly configuring it as the guide states. This helps in two ways: it allows you to begin thinking about ways you can improve your configuration for your specific needs and it will give you tools to fix things when they break later.
- Stay lean. There may be some portions of the guide which may not apply to your application's needs. Instead of wasting your time on additional daemons that you don't need, skip over any parts of the guide that don't apply to your specific application. On the other hand, if you find that your application needs more functionality than this guide provides, be sure to add in extra functionality carefully. See the previous bullet point to understand what I'm talking about.
- This is not the only way to configure a redundant cloud environment. This guide covers the configuration that I like best. If you don't like a particular daemon or Linux distribution mentioned in the guide, use what you're most comfortable with or what you prefer.
- Cloud is what you make of it. Don't be afraid to forge your own path.
- Give me feedback. If you spot something that's incorrect, or if you find a more efficient way to handle a particular problem, let me know! I'll be glad to consider it for the guide and you'll receive proper attribution.
With that out of the way, let's begin the guide.
The high-level overview
To get an idea of the end result, review the diagram shown below:

My redundant cloud hosting configuration includes two load balancers, two web nodes, and two database/caching nodes. They all have interfaces on public and private networks.
There are three main service groups that I need to host my applications:
- Load balancing layer: Two needs are fulfilled at this layer - the distribution of load as well as redirection of traffic away from problematic web nodes.
- heartbeat: adds automated redundancy by managing resources between multiple servers (http://linux-ha.org/wiki/Heartbeat)
- ldirectord: allows for simple LVS configuration and also includes monitoring for misbehaving nodes (http://horms.net/projects/ldirectord/)
- Web service layer: As you could imagine, this layer is the workhorse of the entire configuration. This is where web content is served and where web content is stored in a clustered filesystem.
- apache: tried and true open-source web server (http://httpd.apache.org/)
- haproxy: high performance load balancing and caching daemon (http://haproxy.1wt.eu/)
- vsftpd: FTP daemon (http://vsftpd.beasts.org/)
- glusterfs: simple clustered storage (http://www.gluster.org/)
- Database/caching layer: Without this layer, the configuration would grind to a halt. The applications running on the web services layer depend on this layer for rapid storage and retrieval of information.
- MySQL: open-source database server (http://dev.mysql.com/)
- memcached: memory object caching system (http://memcached.org/)
Platform requirements
In order to follow this guide, you'll need the following:
- Stable Linux distribution - pick whichever one you prefer, but I'll be using Fedora
- Six virtual machines - anything less than six will get a bit tricky and it reduces your redundancy
- Public and private network interfaces on each virtual machine - not required, but it's highly recommended
- One extra IP address - this will be your virtual IP address for load balancing (you will need more if you're hosting multiple sites with SSL, unless you want to use SNI)
- Ability to share an IP between multiple virtual machines - this will be a requirement for LVS-TUN (if you can't share IP's, you can try using LVS-NAT, but I wouldn't recommend it)
- Kernel modules - you'll need a few kernel modules, or the ability to compile and use them with your running kernel
- Linux kernel 2.6.27 or later - there are some great performance improvements for virtual machines and the fuse module in these kernels (not a strict requirement, but highly recommended)
Step by step
I've broken the guide up into functional pieces to allow you to build your configuration and test it along the way. Click on the title of each step to see detailed instructions, diagrams and explanations:
- Step 1: Setting up a redundant database/caching layer
- Includes: setting up MySQL with drbd and heartbeat, installing memcached, testing failover
- Step 2: Communication between web nodes and the database/caching layer
- Includes: configuring haproxy, testing failover
- Step 3: Configuring LVS-TUN and monitoring of web service nodes
- Includes: ldirectord and heartbeat installation on the load balancers, tunnel configuration on web nodes
- Step 4: Wrapping up
- Includes: security tightening and final adjustments
What's the total cost?
Right now, I'm hosting this configuration with Slicehost with the following setup:
- load balancers: two 256MB instances (2 x $20/month)
- web nodes: two 1024MB instances (2 x $70/month)
- database nodes: two 512MB instances (2 x $38/month)
That adds up to $256 per month for the entire configuration at Slicehost. That price also includes 2.1TB of public bandwidth (since the bandwidth is pooled between instances). The only large consumers of bandwidth are the web nodes since they send out a lot of traffic. The load balancers simply receive requests on the public interface and shuttle them to the web nodes over the private network. The database servers would only talk to the public network for package updates.
If you wanted to host the same configuration with Rackspace's Cloud Servers, you could do it for as little as $153.30 per month, but your bandwidth would be billed at the utility rates. For low traffic sites, this may be the better priced option.

What kinds of failures did you get through with this? I'm wondering whether
your leaving yourself open to slicehost system bugs. Stephan
Stephan -
Could you elaborate more about what you're after? I'm not sure how to answer your question.
So, you're lucky and your lb or web or db nodes are both on one real host. This host now goes down...so how useful was your load balancing, now?
morphium
morphium - That is a possible scenario, however as the guide states "This is not the only way to configure a redundant cloud environment". You can do lots of things with HA including geographical failover. Host issues aside you could just as easily have some sort of DNS failure. The site would go down but your servers would remain online.
With that in mind you need to design your infrastructure to fit your requirements. If that means ensuring that your site remains online in the event of multiple host failures then that is something you need to design for which may require more servers put online, or across multiple providers. This guide is a great "how to" for people looking to build in the cloud, but is not the solution for everyone.
morphium -
That's a good point that most folks forget. You'd also have to be concerned if all of your instances are in the same datacenter or even the same geographical region.
When it really comes down to it, you're always going to have failure points. You just have to decide how much money and time you want to spend to eliminate each one.
Along with what morphium said..
If I ask support will they allow me to have this setup in a fashion that would allow for it to be more redundant? Could lb1,web1, and db1 be in dfw and lb2,web2,db2 be in ord? Would this setup still work if that were the case? Thanks for this guide it is very handy.
Errr -
You would need a provider which allows you to have the same IP addresses in more than one datacenter. That's a bit tricky to configure network-wise and I haven't seen a cloud hosting provider yet that offers it.
Hey,
Thanks for the guide, lots of useful information. Something I hadn't thought about doing before was loadbalancing memcached... interesting! Would save hassle of specifying all IPs in app config (and changing that with scaling).
I have a question though; I normally use HAProxy for web backend LB. I see you use it for MySQL and memcached but then use LVS for your web lb. What's the general reasoning behind this choice?
@Major Hayden: there's a Dutch hosting provider (http://www.nucleus.be/en/) that can do that. Physically located servers in different datacenters, but because of their network set-up you can share IPs amongst them.
If you're stretching the limits for HA, you should also keep in mind redundancy in:
- DNS layer
- Network layer (no single cable or single switch to act as SPOF)
- Power redundancy (usually not an issue in datacenters)
- Knowledge redundancy (if only one person knows how your HA set-up works, and he drops dead, you're screwed.)
- IP layer (consider using IPs in different subnets to avoid BGP/ISP failures when one of their /24 announces is failing)
- Storage layer (since storage will become the heart and soul of your operation)
Matti -
That's good to know! I'll have to check out their offerings soon.
Major Hayden, here a bit more:
Your redundant system may fail if slicehost has a problem that affects all
of your virtual machines. Such problems may arise in the slicehost networkor the
slicehost software. Presumably these are common elements of your redundant system
Extreme example: the slicehost billing system has a bug and your account is
marked as overdue. All your VM's are deactivated.
My first question: did your system get your application smoothly through any problems, and if yes, which problems?
Second question, what do you think about the slicehost exposure?
Stephan
Stephan -
You're totally correct - hosting with one provider does leave you open to risk. Even if the datacenter(s) is/are running up to speed and your configuration is perfect, an error on the part of the hosting provider could bring it all down. It really comes down to how much time and money you want to spend to make your application redundant.
Is that what you were asking?
Yes, good thanks!
Did your setup handle any outages? What kind?
Stephan
Stephan -
I've had a couple of failures so far. The hardware node that ran my web2 slice failed and had to be replaced. I was able to run on only the web1 server until web2 was back in the rotation. I also ran into a peculiar problem where connections were being held open between the web nodes and the database nodes, but I was able to correct that pretty quickly.
Following your articles on how to make redundant cloud hosting with rackspace cloud servers, I made some tests on glusterfs performance. The bandwidth between internal network (ServiceNet) of cloud servers is limited.
http://cloudservers.rackspacecloud.com/index.php/Frequently_Asked_Questions#Is_there_a_throughput_limit_on_my_server.27s_network_interface_card.3F
This is a serious limit for technologies like DRBD, GlusterFS, etc. Do you have found a solution to this problem?
DRBD with virtual machines..
Rackspace does not officially support it because they don't know how separate partitions will play with resizing..
Any thoughts on this would be highly appreciated.
Serg
morphium -
For example rackspace places your instances on different physical servers by default, anyway you need to check that with hosting company.
Serg
Opening up a networked file share over Rackspace public cloud - does it not need encrption?