Maintain lead acid batteries regularly

Thursdays are usually maintenance day for the electrical power supply company in my area. So there was nearly a full day power cut. Luckily, I have a UPS so that sorts out the problem for 8-9 hours. The lead acid battery I use for my UPS is about 3-4 years old, and these being unsealed batteries they last long, really long if maintained properly.

In the past I have had one such battery last for a decade before requiring a replacement.

Unsealed lead acid batteries require two important maintenance activities:

  1. Topping up distilled water every 6 months
  2. Applying petroleum jelly / grease on the terminals to prevent corrosion
Read More

Ubuntu 18.04 add e1000e Intel driver to dkms

Note: The compile process appears to be broken for driver version 3.8.4. So the following steps will not work for that version. This post will be updated when a suitable fix is found for the same.

Here’s a quick guide on how to add the Intel e1000e driver to DKMS (Dynamic Kernel Module Support) so that it gets installed / uninstalled automatically with future kernel updates and removals.

Download the driver from Intel website https://downloadcenter.intel.com/download/15817

As of my writing this article, the e1000e version is 3.4.2.1. On download the tarball I get e1000e-3.4.2.1.tar.gz.

Extract it to /usr/src:

tar -xzf e1000e-3.4.2.1.tar.gz -C /usr/src

Create a dkms.conf in /usr/src/e1000e-3.4.2.1 with following contents:

PACKAGE_NAME="e1000e"
PACKAGE_VERSION="3.4.2.1"
AUTOINSTALL=yes
MAKE[0]="make -C src/"
BUILT_MODULE_NAME="e1000e"
BUILT_MODULE_LOCATION="src/"
DEST_MODULE_LOCATION="/kernel/drivers/net/ethernet/intel/e1000e"

Next, we have to tell DKMS that such a module has been added and build it for each of the kernels we have on the system:

dkms add -m e1000e/3.4.2.1
for k in /boot/vmlinuz*; do
  dkms install -k ${k##*vmlinuz-} e1000e/3.4.2.1
done

Finally, reboot the system and the new module should be live.

Date range in a MariaDB query using the Sequence Engine

One of my applications involved generating a date-wise report for items created on that day and we needed zeroes against the count of items on the date which had no entries.

User selects a date range and the application must generate this report. Not so easy if I had not come across the MariaDB Sequence Storage Engine!

Sequences have long been good features in databases like Oracle, PostgreSQL and the likes, I absolutely had no idea of it’s existence in MariaDB — just came across it while browsing the documentation of MariaDB.

Here’s a sample of my use case:

MariaDB [test]> create table items (id int unsigned primary key auto_increment, date_created datetime not null);
Query OK, 0 rows affected (0.061 sec)

MariaDB [test]> insert into items (date_created) values ('2019-01-01'), ('2019-01-05'), ('2019-01-06'), ('2019-01-06'), ('2019-01-01'), ('2019-01-10'), ('2019-01-09'), ('2019-01-09'), ('2019-01-09');
Query OK, 9 rows affected (0.032 sec)
Records: 9  Duplicates: 0  Warnings: 0

MariaDB [test]> select * from items;
+----+---------------------+
| id | date_created        |
+----+---------------------+
|  1 | 2019-01-01 00:00:00 |
|  2 | 2019-01-05 00:00:00 |
|  3 | 2019-01-06 00:00:00 |
|  4 | 2019-01-06 00:00:00 |
|  5 | 2019-01-01 00:00:00 |
|  6 | 2019-01-10 00:00:00 |
|  7 | 2019-01-09 00:00:00 |
|  8 | 2019-01-09 00:00:00 |
|  9 | 2019-01-09 00:00:00 |
+----+---------------------+
9 rows in set (0.001 sec)

MariaDB [test]> select date(date_created), count(id) from items group by date(date_created);
+--------------------+-----------+
| date(date_created) | count(id) |
+--------------------+-----------+
| 2019-01-01         |         2 |
| 2019-01-05         |         1 |
| 2019-01-06         |         2 |
| 2019-01-09         |         3 |
| 2019-01-10         |         1 |
+--------------------+-----------+
5 rows in set (0.001 sec)

MariaDB [test]> 

After a couple of attempts with the samples provided in the MariaDB documentation page, I managed to devise a query which provided me exactly what I needed, using SQL UNION:

MariaDB [test]> select dt, max(cnt) from ( select cast( date_add('2019-01-01', interval seq day) as date ) dt, 0 cnt from seq_0_to_11 union select cast( date(date_created) as date ) dt, count(id) cnt from items where date(date_created) between '2019-01-01' and '2019-01-11' group by date(date_created) ) t group by dt order by dt;
+------------+----------+
| dt         | max(cnt) |
+------------+----------+
| 2019-01-01 |        2 |
| 2019-01-02 |        0 |
| 2019-01-03 |        0 |
| 2019-01-04 |        0 |
| 2019-01-05 |        1 |
| 2019-01-06 |        2 |
| 2019-01-07 |        0 |
| 2019-01-08 |        0 |
| 2019-01-09 |        3 |
| 2019-01-10 |        1 |
| 2019-01-11 |        0 |
| 2019-01-12 |        0 |
+------------+----------+
12 rows in set (0.001 sec)

Yeah, that’s basically filling in zero values for the dates on which there were no entries. Can this be done using RIGHT JOIN? I tried to but couldn’t form a JOIN condition. If you know drop a comment!

Multi-WAN DNS in pfSense

Update: I later figured out there are many other places pfSense restarts Unbound, so this is simply not worth the effort. I reversed the changes & moved Unbound to another box and using just DNS forwarder on pfSense — which is used by the Unbound server.

Having multiple broadband connections at home, I have a pfSense which takes care of load balancing and firewalling. pfSense is pretty good in almost everything, except one thing that was annoying me a lot — That it restarted the DNS Resolver (Unbound) every time either of my WAN connections restarted (one of my ISPs restarts the connection periodically), and the traffic originating from the box itself cannot be load balanced across multiple connections due to a limitation in FreeBSD’s implementation of pf itself – it is unable to set the correct source address.

It’s quite annoying that – even when you use the forwarding mode of Unbound, your DNS still goes through a single WAN interface. Moreover, Unbound doesn’t seem to do parallel querying across DNS servers. So if you have listed multiple DNS servers as forwarders it will try them one by one as they fail. Suppose, the WAN interface from which DNS traffic is outgoing is running at full capacity – a download or somebody is streaming a video, then your browsing becomes slow as well – but the browsing itself may go through another WAN connection. Notably, for having a stable multi-WAN setup in pfSense – you have to use forwarding mode. The gateway switching for the box itself doesn’t work reliably in my experience, due to which I’ve had to face “host not found” error messages even when one of the connections was up.

Solution: Use both DNS Forwarder (DNSMasq) and DNS Resolver together. Why not just forwarder? Because Unbound is a better DNS server – in terms of performance and security. And DNSMasq has an excellent feature that’s probably not available in any DNS Forwarder or Resolver – the ability to send parallel queries to the upstream servers. It’s called the all-servers option and what it does is – when it receives a query for domain X, it will send the query to all the servers and return the first response it receives – so you get the response extremely fast.

Basically, run the forwarder on a different port, listening on local host and let it do the actual forwarding to your DNS servers instead of Unbound directly talking to your upstream DNS servers. You may argue that the performance benefits of using Unbound may get negated by using the DNS Forwarder as upstream to Unbound – probably yes, but probaby no too – because of the parallel query part. Moreover, DNSMasq is exclusively used for relaying queries between Unbound and upstream servers – nothing else, not even caching. So that should possibly remove any bottlenecks in their. Personally I’ve used DNSMasq as the only server in the past and it was pretty fast in terms of user experience – I haven’t compared benchmarks and I’m not interested in doing that either.

So go to Services => DNS Forwarder in pfSense and configure it with following settings:

Capture

Then in the custom options box, put the following:

no-negcache
cache-size=0
resolv-file=/var/run/resolv.conf

You may be wondering why I’ve specified a different resolv.conf file here, it’s to prevent DNSMasq from using Unbound as a DNS server. That can be achieved by tweaking the “Disable DNS Forwarder” flag in System => General setup, but that would mean the box’s DNS will become slower. I want to use Unbound for the pfSense box too, but DNSMasq should not use Unbound.

Then in DNS Resolver settings:

Disable Forwarding Mode – Otherwise pfSense will put in the configured upstream DNS  servers in Unbound’s configuration.

In custom options box:

server:
  do-not-query-localhost: no
forward-zone:
  name: "."
  forward-addr: 127.0.0.1@5353

Further, to prevent Unbound from restarting whenever your WAN connection is restarted –

  • SSH to your pfSense box
  • Open /etc/rc.newwanip in your favorite editor, go to line 228 which calls a function services_unbound_configure(). Comment the line.
  • Then open /etc/inc/system.inc, go to line 175 which opens /etc/resolv.conf for writing and replace it with:
$fd = fopen("/var/run/resolv.conf", "w");

Then in System => General setup, tick the box which says “Do not use the DNS Forwarder/DNS Resolver as a DNS server for the firewall”. Now, pfSense will write the real resolv.conf based on your configured DNS servers or whatever your ISP supplies (configurable in General Setup) to /var/run/resolv.conf and in /etc/resolv.conf just put one line:

nameserver 127.0.0.1

This will ensure that the pfSense uses Unbound as it’s resolver, so immune to WAN failures.

So now, the DNS works something like this:

Client => Unbound => DNSMasq => Upstream DNS Server

Here, client can be any client on the LAN or pfSense itself. And because of the all-servers feature of DNSMasq, both WAN connections will get used for DNS traffic and choking of one line will not cripple DNS.

A WAN monitor running on Google AppEngine written in Go language

As I stated in my earlier post, I have two WAN connections and of course, there’s a need to monitor them. The monitoring logic is pretty simple, it will send me a message on Telegram every time there’s a state change – UP or DOWN.

Initially this monitoring logic was built as OpenWrt hotplug script which used to trigger on interface UP / DOWN events as described in this article. But then I got a mini PC box and it runs Ubuntu and a pfsense virtual machine. While I could build the same logic by discovering hooks in the pfsense code, but it’s too complex and moreover it doesn’t really make sense to monitor the connection of a device using the same connections!

Perfect, time for a new small project. I was trying to learn Go language, what can be a better way to learn a new programming language other than solving a problem? I build my solution using Google AppEngine in Go.

Why AppEngine? Well, yes I could use any random monitoring service out there but I doubt any such service exists which sends alerts on Telegram. Also, AppEngine is included in the Google Cloud Free Tier. So it makes a lot of sense here. My monitoring program runs off Google’s epic infrastructure and I don’t have to pay anything for it!

If you’ve looked at Go examples, it’s pretty easy to spin up a web server. AppEngine makes running your own Go based app even easier, though with a bit of restrictions which is documented nicely by Google in their docs. The restrictions are mostly about outgoing connections and file modifications. While I don’t need to read/write any files, but I need to make outgoing connections, for which I used the required libraries.

AppEngine app always consists of a file app.yaml which describes the runtime, and url endpoints. So here’s mine:

runtime: go
api_version: go1.8

handlers:
  - url: /checkisps
    script: _go_app
    login: admin
  - url: /
    script: _go_app

Now the main code which will handle the requests:

package main

import (
	"fmt"
	"ispinfo"
	"net/http"
	"time"

	"google.golang.org/appengine"
)

var isps = [2]ispinfo.ISPInfo{
	ispinfo.ISPInfo{
		Name:       "ISP1",
		IPAddress:  "x.x.x.x",
		PortNumber: 80,
	},
	ispinfo.ISPInfo{
		Name:       "ISP2",
		IPAddress:  "y.y.y.y",
		PortNumber: 443,
	},
}

func main() {
	http.HandleFunc("/", handle)
	http.HandleFunc("/checkisps", checkisps)

	for index := range isps {
		isps[index].State = false
		isps[index].LastCheck = time.Now()
	}

	appengine.Main()
}

func handle(w http.ResponseWriter, r *http.Request) {
	location := time.FixedZone("IST", 19800)
	w.Header().Add("Content-Type", "text/plain")

	for _, isp := range isps {
		fmt.Fprintln(w, "ISP", isp.Name, "is", isp.Status(), ". Last Checked at", isp.LastCheck.In(location).Format(time.RFC822))
	}
}

func checkisps(w http.ResponseWriter, r *http.Request) {
	for idx := range isps {
		oldstate := isps[idx].State
		isps[idx].Check(r)

		if oldstate != isps[idx].State {
			defer isps[idx].SendAlert(r, w)
		}
	}
}

I separated the code into two packages to keep it clean, so here’s the ispinfo package:

package ispinfo

import (
	"fmt"
	"net/http"
	"net/url"
	"time"

	"google.golang.org/appengine"
	"google.golang.org/appengine/socket"
	"google.golang.org/appengine/urlfetch"
)

type ISPInfo struct {
	Name       string
	IPAddress  string
	PortNumber uint16
	State      bool
	LastCheck  time.Time
}

var telegram_bot_token := "<<< bot token >>>"
var telegram_chat_id := "<<< chat id >>>"

func (i *ISPInfo) Check(r *http.Request) {
	ctx := appengine.NewContext(r)

	host := fmt.Sprintf("%s:%d", i.IPAddress, i.PortNumber)
	timeout, _ := time.ParseDuration("5s")
	conn, err := socket.DialTimeout(ctx, "tcp", host, timeout)
	i.LastCheck = time.Now()

	if err == nil {
		i.State = true
	} else {
		i.State = false
	}

	conn.Close()
}

func (i *ISPInfo) SendAlert(r *http.Request, w http.ResponseWriter) {
	ctx := appengine.NewContext(r)
	client := urlfetch.Client(ctx)

	message := fmt.Sprintf("%s is %s", i.Name, i.Status())
	params := url.Values{}
	url := fmt.Sprintf("https://api.telegram.org/bot%s/sendMessage", telegram_bot_token)

	params.Add("chat_id", telegram_chat_id)
	params.Add("text", message)

	response, err := client.PostForm(url, params)

	if err != nil {
		fmt.Fprintln(w, "Error sending message ", err.Error())
		return
	}

	response.Body.Close()
}

func (i *ISPInfo) Status() string {
	if i.State {
		return "UP"
	}

	return "DOWN"
}

Since the connection status needs to be monitored periodically define a cron job for it, in cron.yaml:

cron:
  - description: "check connection status"
    url: /checkisps
    schedule: every 5 mins
 

gcloud app deploy app.yaml cron.yaml  in the directory and the app is ready!

This is a small monitoring service that managed to build in a couple of hours while learning Go language and the AppEngine API. It should take hardly an hour for a pro. Also I didn’t really follow the correct packaging principles – the ispinfo package exposes pretty much all fields. This could have been better.

The code is available in my github repository in case you’re interested in it.

Monitoring your internet connections with OpenWRT and a Telegram Bot

For the past 5 years or so, I have been using a single ISP at home and mobile data for backup when it went down. But since last few months, the ISP service became a bit unreliable – this is more related to the rainy season. Mobile data doesn’t give fiber like constant speeds I get on the wire. It’s very annoying to browse at < 10 Mbps on mobile data when you are used to 100 Mbps on the wire.

I decided to get another fiber pipe from a local ISP. One needs to be very unlucky to have both going down at the same time – I hope that never happens. Now the question is how to monitor the two connections: Why do I need monitoring? – so that I can inform the ISP when it goes down, with the fail-over happening automatically thanks to OpenWRT’s mwan3 package, I won’t ever know when I am using which ISP (unless I am checking the public IP address, of course).

The solution: A custom API and a Telegram bot. For those not aware about Telegram, it is an amazing messaging app just like Whatsapp with way more features (bots, channels), and does away with some idiosyncrasies of Whatsapp such as restricting you to always have the phone connected.

A Telegram bot is fairly simple to write, you just have to use their API. Now this bot is just going to send me messages, I am never going to send any to it, so implementing my WAN monitor bot was very easy.

My router is a TP Link WR740N which has 4 MB flash – so it is not possible to have curl with SSL support which is required by the API. I wrote a custom script which can be called over HTTP and plays well with the default wget. The script is present on a cloud server which can, obviously, do the SSL stuff.

A custom wrapper to Telegram API to send message in PHP:

<?php

$key = '<a random key>';

if ($_REQUEST['key'] != $key) {
    die("Access Denied");
}

$interface = $_REQUEST['interface'];
$status = $_REQUEST['status'];

$interface_map = array(
    'wan1' => 'ISP1',
    'wan2' => 'ISP2'
);

$status_map = array(
    'ifdown' => 'Down',
    'ifup' => 'Up'
);

$message = "TRIGGER: ${interface_map[$interface]} is ${status_map[$status]}";

$ch = curl_init("https://api.telegram.org/bot<bot ID>/sendMessage");

curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
    'chat_id' => '<your chat id>',
    'text' => $message
));
curl_exec($ch);

The <your chat id>  part needs to be discovered once you send a /start command to your bot and use Telegram’s getUpdates method. You will get it in API’s response JSON. $key  is just a security check to prevent external attacks on the script.

And this script is called on interface events by mwan3 (/etc/mwan3.user ):

wget -O /dev/null "http://<server ip address>/wanupdate.php?interface=$INTERFACE&status=$ACTION&key=<your random key>" >/dev/null 2>/dev/null &

Shell script to monitor connections by cron directly from the server:

#!/bin/bash

nc -w 2 -z <ip address> <port number>
isp1_status=$?

nc -w 2 -z <ip address> <port number>
isp2_status=$?

sendmsg() {
    curl https://api.telegram.org/bot<bot id>/sendMessage -d chat_id=<chat id> -d text="$1" &> /dev/null
}

if [[ $isp1_status -ne 0 ]]; then
    sendmsg "MONITOR: ISP1 is Down"
fi

if [[ $isp2_status -ne 0 ]]; then
    sendmsg "MONITOR: ISP2 is Down"
fi

The above script uses netcat to do the link test using a TCP connection to a port number which is port forwarded to a server because I found ping was doing some false positives. I couldn’t reproduce it when I was trying it manually but I used to get DOWN messages even though the connection was working.

One must wonder though, how will the message reach me via Telegram when both ISPs go down at the same time – well I leave that job to Android and mobile data. Android switches to mobile data as soon as it finds WiFi doesn’t have internet access.

Asterisk PJSIP wizard and phone provisioning

So after setting up Asterisk with a working DAHDI configuration for the PBX project, next was configuration for IP phones using PJSIP and provisioning them.

Asterisk has a built-in module called res_phoneprov which handles HTTP based phone provisioning but that didn’t work for me – I just couldn’t have it generate XML configuration for the phones that we had, i.e. Grandstream GXP1625.

The server on which I had configured PBX was multi-homed, as in it was part of multiple networks. But there was no reason to run the service on all interfaces except the VLAN on which we were going to connect the phones.

Read More

Asterisk PBX with Reliance PRI Line using Digium TE131F

So I got an opportunity to set up Asterisk PBX with a Reliance Communications E1 line. I have worked with Asterisk PBX, but without PSTN interfacing. This post is about what all stuff I have done to get a Reliance E1 line with Digium TE131F card.

Having explored a lot of other distributions like Fedora, Arch, Gentoo, Sabayon, etc. since I ventured into Linux world and learning the internals of Linux and how different components are stitched together I settled on Ubuntu. It’s my favorite these days because  everything seems to work out of the box… except when it doesn’t, then you have PPAs. 😛 For this project I have installed Ubuntu 16.04 server edition.

The default Asterisk 13.13 build which is available in the Ubuntu Xenial repository is broken. If you enable the PJSIP channel (res_pjsip.so), Asterisk will crash immediately after starting up. There is even a launchpad bug for the same => Asterisk crashes with default install because of pjsip. Thankfully arpagon (Sebastian) has fixed it in his PPA which is given in the last comment of the bug. The PPA is this one => Asterisk PPA : “Sapian” team.

So add the PPA (but do not install asterisk yet):

sudo add-apt-repository ppa:sapian/asterisk
sudo apt update

DAHDI Driver Installation

First of all, we need to install the DAHDI drivers. The Digium TE131F driver is absent with the default install (launchpad bug) and needs to extra steps so that the module gets installed, as given below.

Install dahdi-dkms  package and then remove all the drivers that got installed into current kernel when the package was installed:

apt install dahdi-dkms
dkms uninstall dahdi/2.10.2~dfsg-1ubuntu1
dkms remove dahdi/2.10.2~dfsg-1ubuntu1 --all

Then go to /usr/src/dahdi-2.10.2~dfsg-1ubuntu1  and edit the dkms.conf  file to add these lines at the end of the file, just before AUTOINSTALL=yes  line. This should be the last few lines of the file:

BUILT_MODULE_NAME[33]="oct612x"
BUILT_MODULE_LOCATION[33]="drivers/dahdi/oct612x/"
DEST_MODULE_LOCATION[33]="/kernel/drivers/telephony/dahdi"

BUILT_MODULE_NAME[32]="wcte13xp"
BUILT_MODULE_LOCATION[32]="drivers/dahdi/"
DEST_MODULE_LOCATION[32]="/kernel/drivers/telephony/dahdi"

AUTOINSTALL=yes

Now we can proceed to installing the dkms drivers to our current kernel and asterisk as well:

dkms add dahdi/2.10.2~dfsg-1ubuntu1
dkms install dahdi/2.10.2~dfsg-1ubuntu1
apt install asterisk asterisk-dahdi

DAHDI Configuration

Edit /etc/dahdi/modules  and delete the line that says dahdi_dummy . In the same file add a line wcte13xp. So this should be the content of /etc/dahdi/modules :

dahdi
dahdi_transcode
wcte13xp

Create a modprobe configuration file named /etc/modprobe.d/pri.conf  with the content as below:

options dahdi auto_assign_spans=1
options wcte13xp default_linemode=e1

Actually the modprobe configuration file can be avoided if you want to manually configure the DAHDI cards, and it is useful to do so only if you have more than one card in the system. For my case, it is just one so I went this way.

Then load the modules for the card:

modprobe wcte13xp
modprobe dahdi_transcode

Run dmesg  to see if there were any errors, when I had done this I got some firmware error that card firmware is running some older version than the one required by the driver. To fix that I downloaded the required TE133 firmware file (the same version as demanded by the driver, given in the dmesg ) from Digium and placed it into /lib/firmware , then unloaded wcte13xp module and reloaded it. After doing that you will see the driver will start flashing the new firmware, once the process is complete the card will be ready for use.

Now we are ready to configure DAHDI for use by Asterisk, first we generate configuration which is used by the DAHDI Linux Tools:

dahdi_genconf --line-mode=E1 -v

Edit /etc/dahdi/system.conf  and change the variables loadzone , defaultzone  to your country. In my case it is India, so:

# Autogenerated by /usr/sbin/dahdi_genconf on Mon Jul 10 08:43:20 2017
# If you edit this file and execute /usr/sbin/dahdi_genconf again,
# your manual changes will be LOST.
# Dahdi Configuration File
#
# This file is parsed by the Dahdi Configurator, dahdi_cfg
#
# Span 1: WCT13x/0 "Wildcard TE131/TE133 Card 0" (MASTER) CCS/HDB3/CRC4 RED ClockSource 
span=1,1,0,ccs,hdb3,crc4
# termtype: te
bchan=1-15,17-31
dchan=16
echocanceller=oslec,1-15,17-31

# Global data

loadzone	= in
defaultzone	= in

Before jumping on to Asterisk configuration let’s first check if DAHDI configuration works correctly. Read the output of the command below carefully and ensure there is no error present. When I was doing I got some Echo Cancellation related error and that was apparently due to a kernel upgrade that was done by aptitude. I had to redo the driver installation step after rebooting to fix that.

dahdi_cfg -vvv

Asterisk Configuration

There are a lot of parameters in the DAHDI channel configuration, and we do not really care about what they are at this moment, we just want to get the thing working so just edit  , go to the end of file, add following line:

#include /etc/asterisk/dahdi-channels.conf

Let’s build a sample extensions.conf that plays Hello World when called. I prefer to configure Asterisk using LUA because I’m a programmer and it’s easier for me to do that then learn a new way to configure something. So this configuration is limited to just playing Hello World for testing purpose:

[from-pstn]
exten = _X.,1,Answer()
same = n,Wait(3)
same = n,Playback(hello-world)
same = n,Hangup()

Finally to ensure that DAHDI is configured properly before Asterisk starts we need to add a override to the Asterisk SystemD Unit file, so run systemctl edit asterisk and add the lines below:

[Service]
ExecStartPre=/usr/sbin/dahdi_cfg

Additionally, we also need to load the DAHDI and our card modules at boot, so:

ln -s /etc/dahdi/modules /etc/modules-load.d/dahdi

The above two steps are required because the dahdi-linux package does not provide the DAHDI init script which does these things. I have filed a bug for the same, at launchpad.

Restart Asterisk and call your allocated DID and you should hear Hello World from Asterisk!

ZFS convert stripe to striped-mirror

OpenZFS Logo

I’m a huge fan of ZFS because of its performance and other features like snapshots, transparent compression. In fact I had switched to FreeBSD for servers just because it had native ZFS support. But as of Ubuntu 16.04, ZFS is officially supported for non-root partitions.

Now I’m migrating a FreeBSD server to Ubuntu 16.04 with ZFS for data storage – this is happening because I need support for some special hardware which has drivers only for Linux and I do not have a spare server machine of same capacity in terms of memory/disk/processor.

My case –
Here’s the zpool layout on my existing FreeBSD server:

        zroot              ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            diskid/DISK-1  ONLINE       0     0     0
            diskid/DISK-2  ONLINE       0     0     0
          mirror-1         ONLINE       0     0     0
            diskid/DISK-3  ONLINE       0     0     0
            diskid/DISK-4  ONLINE       0     0     0  

Each of those disks are 1TB in size and the layout here is something known as RAID 10, or striped mirroring. Striped mirroring can be extended to more than four disks but in my case, I have two pairs of disks. Each pair is mirrored and the each such mirror is striped, illustrated as in the image below:

Image taken from techtarget.com, their trademark/copyright holds.

The advantage of this layout is that you get read speed of four disks, and write speed of two disks and a failure tolerance of two disks (but in different mirrors) at the same time.

I have a spare 1TB disk which I can use for preparing a new server using a low-end machine for migration. I remove one of the disks from the live server so the pool there runs in a degraded state. The removed disk is used in the new server. So I create this zpool in Ubuntu:

zpool create zdata /dev/disk/by-id/disk1-part3 /dev/disk/by-id/disk2-part3
zpool status

        zdata          ONLINE       0     0     0
          disk1-part3  ONLINE       0     0     0
          disk2-part3  ONLINE       0     0     0

The pool created here is a plain simple stripe. To convert this into a striped-mirror, the zpool attach command has to be used:

zpool attach zdata disk1-part3 disk3-part3
zpool attach zdata disk2-part3 disk4-part3

With this, the pool now becomes a striped mirror:

	zdata            ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    disk1-part3  ONLINE       0     0     0
	    disk3-part3  ONLINE       0     0     0
	  mirror-1       ONLINE       0     0     0
	    disk2-part3  ONLINE       0     0     0
	    disk4-part3  ONLINE       0     0     0

Perfect! 😀

SystemD FastCGI multiple processes

Of late, many mainstream distributions have been switching to SystemD as their init system. This includes Debian (since Debian 8) and Ubuntu (since Ubuntu 15.04). In the traditional SysV init system we used to have stuff like spawn-fcgi or custom scripts for starting a FastCGI process and having the web server connect to it over Unix or TCP sockets. Such kind of usage decreased when PHP FPM was introduced since it’s safe enough to assume that 90% (probably more) of the FastCGI deployments are just launching PHP interpreters using whatever mechanism is there (spawn-fcgi or custom scripts). PHP FPM does this for you now and it’s pretty good at it.

FastCGI is just a protocol, it can be used by any application. For custom applications which do not support starting their own FastCGI processes and listening on a socket we have to use external mechanisms. SystemD has a couple of good features which can help reduce the amount of custom work needed in terms of process monitoring, socket paths, file ownership, etc.

My use case:

So I installed the support ticketing system called Request Tracker from Best Practical on a server, and initially using spawn-fcgi to start the FastCGI process and nginx to serve the site. I found problems quite fast: spawn-fcgi would create N number of processes which are listening on the given socket, but there’s no re-launching mechanism if the processes die. Applications can crash any time, so we need something to relaunch. The traditional options would have been to use something like a cron job to monitor the PID file or use monit, there are many options. Then I came across this article by the spawn-fcgi people about how to use SystemD to start a FastCGI process.

Once again, a custom script is involved. Digging through SystemD documentation and some more Googling, I was able to get the FastCGI process spawning working using SystemD without any external dependencies (no extra scripts, etc). Here’s how:

[Unit]
Description = Request Tracker FastCGI backend
After = postgresql.service
Wants = postgresql.service

[Service]
User = rt
Group = rt
ExecStart = /usr/local/rt/sbin/rt-server.fcgi
StandardOutput = null
StandardInput = socket
StandardError = null
Restart = always

[Install]
WantedBy = multi-user.target
[Unit]
Description = RT FastCGI Socket

[Socket]
SocketUser = www-data
SocketGroup = www-data
SocketMode = 0660
ListenStream = /run/rt.sock

[Install]
WantedBy = sockets.target

If you do systemctl enable rt-fastcgi.socket && systemctl start rt-fastcgi.socket  then you should have the RT process started up by SystemD when the first request arrives from the web server at /run/rt.sock , and the process keeps running listening for further requests. It gets restarted automatically in case if it crashes or is issued a SIGTERM  manually (which is needed if you make changes to the configuration file).

The problem with this setup:

If you see the difference between this method and the spawn-fcgi method, you will observe that while it is possible to spawn multiple request handlers (i.e. multiple RT processes which can serve the web server) using a single command but the same is not possible with this SystemD method. The daemon being spawned by SystemD must do this extra handling. A single process cannot serve many concurrent requests which is definitely a problem. So let’s use SystemD’s automatic unit files based on per instance feature to make it multiple processes:

[Unit]
Description = Request Tracker FastCGI backend (instance %i)
After = postgresql.service
Wants = postgresql.service

[Service]
User = rt
Group = rt
ExecStart = /usr/local/rt/sbin/rt-server.fcgi
StandardOutput = null
StandardInput = socket
StandardError = null
Restart = always

[Install]
WantedBy = multi-user.target
[Unit]
Description = RT FastCGI Socket (instance %i)

[Socket]
SocketUser = www-data
SocketGroup = www-data
SocketMode = 0660
ListenStream = /run/rt%i.sock

[Install]
WantedBy = sockets.target

What the above setup would do is, listen for requests on multiple sockets at /run/rt%i.sock and spawn the appropriate number of instances. How to use this feature from nginx? Here’s how:

upstream rt_backend {
    server unix:/run/rt1.sock;
    server unix:/run/rt2.sock;
    server unix:/run/rt3.sock;
    server unix:/run/rt4.sock;
    server unix:/run/rt5.sock;
}

server {
# other stuff whatever needed

    location / {
        include fastcgi_params;
        fastcgi_param SCRIPT_NAME "";
        fastcgi_param PATH_INFO $uri;
        #fastcgi_pass unix:/run/rt.sock;
        fastcgi_pass rt_backend;
    }
}

Then just do systemctl enable rt-fastcgi@{1..N}.socket && systemctl start rt-fastcgi@{1..N}.socket , start firing requests from nginx and you should see the number of processes growing because nginx sends requests to the upstream in a Round Robin fashion (which can be changed, of course. Refer to nginx documentation for that).