One of the servers I run has FreeBSD 10. It hosts a high traffic Magento site. Magento being a very heavy application, requires a dedicated server. The site’s performance is very bad when it is hosted on VPS — or perhaps that depends on provider / needs tuning. Not my site. My task was to move it to dedicated server so I don’t have to consider all that stuff.
As someone new to FreeBSD, I try to stick to tools and utilities that are provided by FreeBSD itself and do not rely on those provided by other BSDs. This rule is quite flexible, but I can’t cite examples of relying on tools by other BSDs that I’m using right now. So, naturally, for firewall I chose IPFW which is FreeBSD’s own firewall. The other firewalls supported by FreeBSD are PF (which comes from OpenBSD) and IPFilter (which comes from NetBSD).
Recently, my customer was complaining about mails not going through properly, and site slowdowns. I couldn’t see the issue on my connection so I told him it could be a ISP level routing issue (which happens quite often here in India), but gets fixed automatically after some time. But the problem didn’t go away even after a week and then Pingdom started complaining about unreachable website. I could still ssh into the server and the site was reachable from my end.
First I tried a netstat on the server, and to my surprise, there were some 4000 TCP connections to port 80 which were in FIN_WAIT_2 state. A connection stays in FIN_WAIT_2 state when the one side has sent (in this case, the server) the signal for closing the connection but the client didn’t close the connection properly. TCP connections must be closed properly from both sides.
I was clueless about what to do, but then I saw the kernel log (dmesg), and to my shock, I saw thousands of lines which contained the message:
ipfw: add_dyn_rule: Cannot allocate rule
By now, the site had almost went down on customer’s side so he was panicking. I tried a bit of Google on this issue but there wasn’t any documentation. The results linked me to source code in BSDs. Finally a reboot fixed it. Finally I found from my friend on Twitter @FreeBSDHelp, about the reason. It didn’t take time for me to link the number of FIN_WAIT_2 sockets and the rule limit being exceeded. While the tunable for this is net.inet.ip.fw.dyn_max, but the let’s address the root cause of the issue – FIN_WAIT_2 sockets because if I exceed the dynamic rule limit, it is quite likely that it could hit the limit again.
I found two tunables in this regard, net.inet.tcp.fast_finwait2_recycle and net.inet.tcp.finwait2_timeout . The default for the first one is 0 and second one is 60000 (60 seconds). What this means is that, once the server has sent a FIN, server will wait for up to 1 minute before discarding the socket. 1 minute is a big time for servers handling heavy traffic and browsers of this era keeping unwanted connections to websites for improving user experience. I have heard that enabling former option causes some problems when being behind a NAT, but I’m not sure what exactly is that, and since the server is not behind a NAT I don’t have a reason to care about it. So enabled that and reduced the timeout to 15000 (15s) which is the timeout configured in Apache’s KeepAlive settings as well.
So far, the issue hasn’t appeared again, but I’m still observing it. This caused a lot of frustration because there was no documentation on this error message and the message itself doesn’t give much information. I have filed a bug regarding this so that the documentation is improved for this. Tomorrow you could be a victim for this problem, so you should support that 😉