One month with GlusterFS in production

As many of you might have noticed from my previous GlusterFS blog post and my various tweets, I've been working with GlusterFS in production for my personal hosting needs for just over a month. I've also been learning quite a bit from some of the folks in the #gluster channel on Freenode. On a few occasions I've even been able to help out with some configuration problems from other users.

There has been quite a bit of interest in GlusterFS as of late and I've been inundated with questions from coworkers, other system administrators and developers. Most folks want to know about its reliability and performance in demanding production environments. I'll try to do my best to cover the big points in this post.

First off, here's now I'm using it in production: I have two web nodes that keep content in sync for various web sites. They each run a GlusterFS server instance and they also mount their GlusterFS share. I'm using the replicate translator to keep both web nodes in sync with client side replication.

Here are my impressions after a month:

I/O speed is often tied heavily to network throughput
This one may seem obvious, but it's not always true in all environments. If you deal with a lot of small files like I do, a 40mbit/sec link between the Xen guests is plenty. Adding extra throughput didn't add any performance to my servers. However, if you wrangle large files on your servers regularly, you may want to consider higher throughput links between your servers. I was able to push just under 900mbit/sec by using dd to create a large file within a GlusterFS mount.

Network and I/O latency are big factors for small file performance
If you have a busy network and the latency creeps up from time to time, you'll find that your small file performance will drop significantly (especially with the replicate translator). Without getting too nerdy (you're welcome to read the technical document on replication), replication is an intensive process. When a file is accessed, the client goes around to each server node to ensure that it not only has a copy of the file being read, but that it has the correct copy. If a server didn't save a copy of a file (due to disk failure or the server being offline when the file was written), it has to be synced across the network from one of the good nodes.

When you write files on replicated servers, the client has to roll through the same process first. Once that's done, it has to lock the file, write to the change log, then do the write operation, drop the change log entries, and then unlock the file. All of those operations must be done on all of the servers. High latency networks will wreak havoc on this process and cause it to take longer than it should.

It's quite obvious that if you have a fast, low-latency network between your servers, slow disks can still be a problem. If the client is waiting on the server nodes' disks to write data, the read and write performance will suffer. I've tested this in environments with fast networks and very busy RAID arrays. Even if the network was very underutilized, slow disks could cut performance drastically.

Monitoring GlusterFS isn't easy
When the client has communication problems with the server nodes, some weird things can happen. I've seen situations where the client loses connections to the servers (see the next section on reliability) and the client mount simply hangs. In other situations, the client has been knocked offline entirely and the process is missing from the process tree by the time I logged in. Your monitoring will need to ensure that the mount is active and is responding in a timely fashion.

There's a handy script which allows you to monitor GlusterFS mounts via nagios that Ian Rogers put together. Also, you can get some historical data with acrollet's munin-glusterfs plugin.

GlusterFS 3.x is pretty reliable
When I first started working with GlusterFS, I was using a version from the 2.x tree. The Fedora package maintainer hadn't updated the package in quite some time, but I figured it should work well enough for my needs. I found that the small file performance was lacking and the nodes often had communication issues when many files were being accessed or written simultaneously. This improved when I built my own RPMs of 3.0.4 (and later 3.0.5) and began using those instead.

I did some failure testing by hard cycling the server and client nodes and found some interesting results. First off, abruptly pulling clients had no effects on the other clients or the server nodes. The connection eventually timed out and the servers logged the timeout as expected.

Abruptly pulling servers led to some mixed results. In the 2.x branch, I saw client hangs and timeouts when I abruptly removed a server. This appears to be mostly corrected in the 3.x branch. If you're using replicate, it's important to keep in mind that the first server volume listed in your client's volume file is the one that will be coordinating the file and directory locking. Should that one fall offline quickly, you'll see a hiccup in performance for a brief moment and the next server will be used for coordinating the locking. When your original server comes back up, the locking coordination will shift back.

Conclusion
I'm really impressed with how much GlusterFS can do with the simplicity of how it operates. Sure, you can get better performance and more features (sometimes) from something like Lustre or GFS2, but the amount of work required to stand up that kind of cluster isn't trivial. GlusterFS really only requires that your kernel have FUSE support (it's been in mainline kernels since 2.6.14).

There are some things that GlusterFS really needs in order to succeed:

  • Documentation - The current documentation is often out of date and confusing. I've even found instances where the documentation contradicts itself. While there are some good technical documents about the design of some translators, they really ought to do some more work there.
  • Statistics gathering - It's very difficult to find out what GlusterFS is doing and where it can be optimized. Profiling your environment to find your bottlenecks is nearly impossible with the 2.x and 3.x branches. It doesn't make it easier when some of the performance translators actually decrease performance.
  • Community involvement - This ties back into the documentation part a little, but it would be nice to see more participation from Gluster employees on IRC and via the mailing lists. They're a little better with mailing list responses than other companies I've seen, but there is still room for improvement.

If you're considering GlusterFS for your servers but you still have more questions, feel free to leave a comment or find me on Freenode (I'm 'rackerhacker').

Printed from: http://rackerhacker.com/2010/08/11/one-month-with-glusterfs-in-production/ .
© Major Hayden 2012.

12 Comments   »

  • Twirrim says:

    Interesting read, thanks. GlusterFS is something that's been on my radar recently. I'm not sure at the moment if I've got a particular use case for it, but it's certainly something I want to keep in mind.

    Was there much manual intervention required in the case of server reboots, or does GlusterFS silently and easily pick up where it failed?

    What do you see as the primary use for GlusterFS?

  • Major Hayden says:

    Twirrim -

    As long as you have your GlusterFS servers configured to automatically start at boot time, there shouldn't be any intervention required. With the replicate translator, there is the chance that a split-brain situation could occur. I haven't seen it myself, but correcting it is a relatively simple and quick process.

  • Hi RackerHacker,

    We are surely working on "Monitoring GlusterFS isn't easy" part.. also we are working on getting the stats during runtime easier.

    Will catch you on IRC :-)

    Regards,
    Amar Tumballi
    (bulde on #gluster)

  • Major Hayden says:

    Amar -

    Thanks for the comment. Statistics would be extremely handy. It'd be nice to know what the limiting factor is in a particular GlusterFS configuration. I've seen some differences in performance on different systems and being able to find the bottleneck would be a huge help.

  • Tom says:

    Hi RackerHacker,

    I have a similar setup (2 web notes with glusterfs, both are clients and servers). Running mainly Joomla CMS with apache on them. My biggest problem after initial setup is a very long time to send headers by web server (about 1s) as seen using http://site-perf.com Have You encountered such behavior with Your setup?

  • Major Hayden says:

    Hey Tom,

    I've seen this delay as well, and it's a bit frustrating to fix. I was able to cut down on the delay in WordPress by storing more of my cache data in memcache rather than on the disk as I had traditionally done. Adjusting the performance translators in the GlusterFS client volume file also helped, but only by a small amount.

  • Hey Tom, thanks for the comments on GlusterFS, extremely valuable to find information about use in a production environment.

    We're currently testing Gluster and have found it's throughput to be terrible. Not sure what is going on, but we're seeing 5 to 10 MB/s writes, and reads well over 2 GB/s. Not sure what is causing such an extreme fluctuation, perhaps you can shed some light. Feel free to email directly, we might be interested in seeking your help.

  • Alex says:

    In the documentation,
    "How can I improve the performance of reading many small files?
    Use the NFS client. For reading many small files, i.e. PHP web serving, the NFS client will perform much better."

    Any idea how to do this? We are experiencing serious web server load because of this problem.

  • Samba Kolusu says:

    Have you tried XtreemFS? It seems to be solving the similar needs but pretty much tied to gether rather than as separate tools.

    The website claims that Xtreeemfs has been primarily designed to work with PC grade hardware distributed across the internet. Relication is built into the filesystem unlike in GlusterFS where we handle copying the files using rsync ( in cas of geo, especially). In the latest version ( not yet released officially), they claim that they are suppporting master-master replication of the entire mounted partition.

    Read performance is pretty good, almost the same as native filesystems like ext3. Write performances seem to be as bad as that GlusterFS. Perhaps this is common to any distributed filesystem.

    Do let me know what you think about this option...
    Thanks and Regards,
    Samba

  • Fred says:

    Hi Racker Hacker,

    We have been using GlusterFS version 2.08 for a couple of years with good results until a few weeks ago. Our problems started occurring when the concurrent traffic to our PHP site increased to about 200 concurrent users on each front end server. At this point the performance of the site was so bad that page requests would just time out and the users would be unable to use the system. Our system has the following characteristics.
    - Load Balancer.

    - Two front end machines with Xeon Quad Core X3360 2.83 GHz, 4 GB of RAM and disks of 7200 RPMs in RAID 1 configuration. Each of these machines are running apache servers and we were using GlusterFS which is installed in both of the front end machines and was used to maintain the files submitted by the users as well as the website code and its contents (PHP, images, css, js, html) synchronized between these two machines.

    - Database server with 2x Xeon Quad Core E5410 2.33 GHz, 8 GB of RAM and disks of 10 k RPMs in RAID 10 configuration. This server is running an Oracle database.

    In order to replicate the problem faced by the users I used apache’s ab tool to simulate 200 concurrent requests against one of the front end servers and using top and iostat I saw that the io wait time went to the roof and that GlusterFS was using 30-40% of the server’s CPU. Once 200 concurrent requests were being made to the server it became unable to serve php webpages because the user would need to wait forever to get each page to the point where web pages would time out.

    Based on these results I disabled GlusterFS and now each server is able to serve over 400 concurrent requests at a time (I haven’t test it using more requests but I am pretty sure it wouldn’t have any problem serving them). However, based on what I have read about GlusterFS being able to handle cloud infrastructures that contain petabytes of data, I am pretty sure that I must be missing something.

    Can you please help me with this?

    Thanks and Regards,
    Fred

  • Buzz says:

    I've been fighting php on gluster for a while now, and just came across This Article on increasing io-threads for improved performance, I'm looking at it now, thought you may find it interesting.

  • Buzz says:

    Also in a utter *facepalm* moment, during some internal soak testing, I found that setting noatime and using the nfs mount with gluster 3.2 yeilded a 48x performance improvement on a php webapp, compared to using gluster native client with default mount options, I have it undergoing a 3hour soak test now, so far so good.

    Under load Average response time: 0.78s (Down from 21.96s 28.15x improvement) still not as fast as I would like still working out the nuances though.

Trackbacks/Pingbacks

  1. Tweets that mention One month with GlusterFS in production | Racker Hacker -- Topsy.com

RSS feed for comments on this post

Leave a Reply

 

  • Welcome! I started this blog as a way to give back to all of the other system administrators who have taught me something in the past. Writing these posts brings me a lot of enjoyment and I hope you find the information useful. If you spot something that's incorrect or confusing, please write a comment and let me know. Drop me a line if there's something you want to know more about and I'll do my best to write a post on the topic.
    -- Major Hayden

    Flattr this