Twemproxy Twouble

Introduction

We’ve been running a cluster of Redis instances behind Twemproxy for some time now, reaping the benefits of the massive throughput that this affords. Recently, we began to scale up our use of the cluster at an increased rate, whereapon we started to experience errors. One of the heavy users of the cluster is a Go process with Goroutines numbering in the thousands.

Despite the title, this is more a tale of me discovering things about Systemd, with regard to both diagnostics and configuration.

The Setup

Redis

To create the Twemproxy-fronted cluster, we used 16 Redis instances. Redis was already installed on the machine (CentOS 7.x), so we were able to copy and edit the configuration and service unit files, modifying them as appropriate. What follows is an example for one of the instances.

Create a data directory and ensure it is owned by redis.

$ mkdir -p /var/lib/redis-b01
$ chown redis:redis /var/lib/redis-b01

Copy the configuration.

$ cp /etc/redis.conf /etc/redis-b01.conf

Summary of configuration changes.

...
pidfile /var/run/redis/redis-b01.pid
...
port 6380
...
# bind 127.0.0.1
...
unixsocket /tmp/redis-b01.sock
unixsocketperm 760
...
logfile /var/log/redis/redis-b01.log
...
dir /var/lib/redis-b01/
...
requirepass some-strong-password
...

Copy the service unit file.

$ cp /usr/lib/systemd/system/redis.service /usr/lib/systemd/system/redis-b01.service

Changes.

ExecStart=/usr/bin/redis-server /etc/redis-b01.conf --daemonize no
ExecStop=/usr/bin/redis-cli -s /tmp/redis-b01.sock -a some-strong-password shutdown

After enabling and starting with Systemd, we had running Redis instances listening on their own ports and Unix sockets.

Twemproxy

Twemproxy was built from source with the configuration recommended in the documentation on Github. It was installed to /usr/local/twemproxy/.

We use more than one Twemproxy instance to access the shards, but here is an example of one of the configuration files, located at /etc/redis-twemproxy.yml.

alpha:
  listen: /tmp/redis-twemproxy.sock 0760
  hash: fnv1a_64
  distribution: ketama
  auto_eject_hosts: false
  backlog: 4096
  timeout: 2000
  redis: true
  redis_auth: some-strong-password
  servers:
   - 127.0.0.1:6380:1 b01
   - 127.0.0.1:6381:1 b02
   - 127.0.0.1:6382:1 b03
   - 127.0.0.1:6383:1 b04
   - 127.0.0.1:6384:1 b05
   - 127.0.0.1:6385:1 b06
   - 127.0.0.1:6386:1 b07
   - 127.0.0.1:6387:1 b08
   - 127.0.0.1:6388:1 b09
   - 127.0.0.1:6389:1 b10
   - 127.0.0.1:6390:1 b11
   - 127.0.0.1:6391:1 b12
   - 127.0.0.1:6392:1 b13
   - 127.0.0.1:6393:1 b14
   - 127.0.0.1:6394:1 b15
   - 127.0.0.1:6395:1 b16

There is another cluster of instances running as slaves of these, but that detail is not important for our purposes here.

We set up Twemproxy to be managed by Systemd. Our initial unit file looked like this.

[Unit]
Description=Twemproxy (Nutcracker) Redis Proxy.
After=network.target

[Service]
ExecStart=/usr/local/twemproxy/sbin/nutcracker -v 6 -m 4096 -s 22224 -c /etc/redis-twemproxy.yml
User=redis
Group=redis

[Install]
WantedBy=multi-user.target

After adding firewall rules and starting the service, we were in business.

The Twouble

Over time, as we ramped up activity our process began have short outages. Many instances of the following messages were in the logs.

read unix @->/tmp/redis-twemproxy.sock: read: connection reset by peer
write unix @->/tmp/redis-twemproxy.sock: write: broken pipe

To check what was happening with Twemproxy we turned to Systemd’s logging compatriot, Journald. Using it, one can get the log output for a given service unit, starting from a certain date like this.

$ journalctl -u redis-twemproxy --since "2016-07-06 16:00:00"

All of the output from this command looked normal, but there were periods of 30 seconds without any log output. My investigations eventually led me to view the Journald output without limiting it to the service unit, whereupon I observed entries looking like this.

systemd-journal[879]: Suppressed 364370 messages from /system.slice/redis-twemproxy.service

Right, so it turns out that Journald has configuration settings intended to prevent overloading of the logging system. Once a threshold of messages-per-second is exceeded, Journald will back off for a configured period of time. This setting can be changed in /etc/systemd/journald.conf. I initially made the following changes and restarted Journald.

RateLimitInterval=10s
RateLimitBurst=10000

I was still getting log suppression and began winding this out further. Eventually in good ol’ Heisenbug style, I began to see Twemproxy log output containing a lot of messages looking like this.

nc_server.c:384 close s 17 schedule error for req 2109774 len 56 type 26 from c 11: Connection timed out

So what I had actually done was to choke up Twemproxy, by forcing Journald to explicitly handle all of its logging output. In retrospect this was a silly road to go down and I reapplied the default Journald configuration.

Next I did what I should have done in the first instance when I saw that Twemproxy log output was too fine-grained for Journald - lower the logging verbosity. You can see in the service unit definition that we were using the “-v 6” option initially.

Now we were able to get the real word on our problems. The connection limit was being exceed, causing a torrent of exception messages to be output.

nc_proxy.c:330 client connections 975 exceed limit 975

Wait, what? Where does this number come from? Here is a line from Twemproxy’s startup output.

nc_core.c:43 max fds 1024 max client conns 975 max server conns 17

Now this made no sense to me. We’re running a capable machine here; 40 logical cores and 128GB memory, with a massively increased file discriptor limit in /etc/sysctl.conf. Why was it not being honoured?

The Medicine

Some research turned up the fact that Systemd by default does not use a PAM session. The increased limit does not apply in this case. A Systemd service unit can be configured to use a PAM session, or it can be configured with its own file descriptor limit; I chose the latter.

This is the setting I added to the service unit file:

LimitNOFILE=32768

Since restarting Twemproxy with this change, things have been screaming along. So much so that week-on-week, we are processing record volumes of data with the system.