A failed experiment with GlusterFS
GlusterFS is a clustered file system that can be used when you want to share the content across different machines which can be accomplished by NFS as well. But the difference is, NFS failover is hard.
In GlusterFS, you can add two servers known as bricks in Gluster’s terminology on which your volume can be created as a replica. All data is replicated to both the servers by Gluster. GlusterFS has support for advertising the volumes as NFS shares as well, but I didn’t use it because of the basic reason – failover.
The native GlusterFS client can do automatic failover when one of the server dies. When both the servers are up, it will read data in parallel to get better throughput. When the dead server comes back up, all data changes are synced between the two by a process called healing.
GlusterFS as such is a great product and the developers have put a lot of work behind the same. But it has lots of issues and it’s probably a bad idea to use it without Redhat support. The community support is not so good due to low number of community users. The issues may not be perfectly reproducible again. Clustering is a very hard thing to do, that too at a scale of lots of IOPS per second. GlusterFS gets it right to a great extent.
Some detail about the environment where I used GlusterFS 3.4.4:
- Two dedicated servers: i7-3770, 32 GB RAM and RAID10 over 4 disks.
- 1 Gbit line between the two, both of them located in the same DC. The connection was common for internal network as well as the Internet.
- Gentoo Linux
- Both servers acting as Gluster clients as well.
- Gluster volume for home directories, which contains lots of websites (WordPress mostly, Joomla, etc.)
- One of the servers is a database server (MySQL, PostgreSQL) along with a serving node (i.e. PHP, Apache).
Before trying out Gluster, I did read recommendations that the Gluster server should be run on it’s own servers and no other services should be run on the same. But anyway, I gave it a shot and I don’t think that because of running it along with was the root cause of the problem.
Since one of the servers was a database server as well, I set my load balancer, HAProxy to forward 60% of the traffic to the node without database and 40% of the traffic to the node with database, so that the database servers get enough CPU for themselves. Database response is important for any website. All WordPress sites were configured with W3 Total Cache and some of the sites were high traffic ones.
The IO throughput of Gluster was good when the volumes were mounted on each of the nodes. But for some reason, I there was frequent split-brain between the two bricks. When a split-brain occurs and Gluster isn’t able to heal it by it’s self healing daemon, the file throws input/output error on the client. This used to happen only on one of the nodes and I don’t know the reason for the same.
The weirder part is that split-brain occurred on random files and random times. Sometimes it would be the .htaccess or wp-config.php, and I can vouch that neither of those were modified on any node when the split brain occurred.
This would cause HTTP Error 500 to be thrown by Apache because the file cannot be read by PHP or Apache, ultimately causing troubles for clients. Whenever the traffic hit the second node, people would see random 500s or some assets would fail to load (again due to IO Errors).
Another issue was servers crashing due to excessive CPU load, on a machine with 8 CPUs, 4 real 4 logical (Intel hyper threading) the maximum load shouldn’t cross 8.00. You can extend it a bit to 10.00. The initial data sync between the two servers caused the load to cross 50-60. The server would simply crash and require a hard reboot to come back. And again the same thing would happen.
In short, it was a nightmare for me dealing with Gluster. I tried Gluster mailing list, and someone from Redhat has replied, but that doesn’t seem to contain any solution. It would be apt to say that Gluster has taken away more hair away from my head during this month of experiment than what I lost in the past year.
I have now moved off Gluster, to a traditional NFS mount. One server contains all the data and it’s simply NFS-mounted on the other node. This seems to perform far better than Gluster. The IO performance I’m getting out of NFS is better, as well as CPU usage is low.
But indeed, NFS doesn’t give me the advantage of Gluster that is failover. It seems NFS can have the failover thing as well if it is combined with DRBD over two servers. I’ll give it a thought next time, and I definitely think it would perform better than Gluster. Or may be not. That’s a future experiment.
So, for TL;DR, don’t use Gluster without Redhat’s support.
One question, what is the fstab line you use? (In case of access time a.k.a. atime was affecting etc.)
In Gentoo fstab is not required for mounting gluster volumes. You can add it as a service. Anyway, I didn’t know that gluster supported atime option because it wasn’t mentioned in the manual of the same. I had mounted it with acl though.
I think you answered your own question: “Before trying out Gluster, I did read recommendations that the Gluster server should be run on it’s own servers and no other services should be run on the same. But anyway, I gave it a shot and I don’t think that because of running it along with was the root cause of the problem.”
Why do you say that? There’s a reason they recommend against it. If you run into contention between client and server processes, there’s nothing to prioritize the server processes to prevent timeout or other IO issues. That could easily lead to the split brain situation you describe.
You added an interesting point of view to those recommendations. This angle isn’t mentioned anywhere.
This might be quite possible.
You haven’t mentioned anything about the actual filesystem used on the individual bricks. Your split-brain problem is because the underlying filesystem cannot handle locks correctly during concurrent access ( Gluster is a user space filesystem and doesn’t handle locking ).
Your only options are GFS or OCFS if you want to avoid locks.
This is why you failed:
1. “Two dedicated servers: i7-3770, 32 GB RAM and RAID10 over 4 disks.” and resulting “The weirder part is that split-brain occurred on random files and random times”
2. “Gentoo Linux”
3. Storing source code for hosting (PHP)
1. GlusterFS is nice for large scale. 2 servers aren’t enough to show it’s power. We have 12 servers cluster (4 distributed x 3 replica), and this is still not enough to show it’s real power, but a lot enough for our needs. Second problem: just DO NOT use RAID. Really. We’ve had a lot problems with IOPS when using underlying RAID. GlusterFS handles JBODs very nice. We had 100% IO utilisation when using RAID6, then tried RAID5 (8 HDDs). Still the same. Then we drop RAID, and now we use JBODs and every disk has IO util below 10% all the time. About split-brain: you should use 3-node cluster! It’s possible to configure GlusterFS to run as 2-node, but it’ll have problems anyway. Just don’t do that.
2. Really – Use RedHat packages. Even on CentOS. We had problems even with CentOS packages of GlusterFS, so we recompiled RedHat’s SRPMs and now it works nice.
3. “The IO performance I’m getting out of NFS is better”
And yeah, you used GlusterFS for hosting PHP sites. It’s not good for that cause it’s not good for linear files reading. It’s very nice for static file hosting though. Like: you have site with photo albums, and you put those photos on GlusterFS. Just don’t do file-by-file reading like PHP does when you include multiple files 😉
We’re using GlusterFS since 3.2. Now on 3.7 (yep, still not upgraded do 3.9).
And yeah, I know this is now almost 3 years old, but this is still accurate and this post is easily to find via Google 😉
Insightful. Will keep it in mind when requiring something similar.
Hey Thanks for posting your experiences. I was considering to try out GlusterFS from NFS.
Thanks for reading the article. But you should keep it in mind that my experience on this is 3 years old – I worked with it in 2014. And a few others have pointed out flaws in my setup (read others comments).