How Not To Selectively epoll(7) — or — Multiplexing Network Socket File Descriptors When There Ain't Nothin' There To Read Anyhow

I've been writing some internal APIs to abstract away the subtleties of managing multiple network peers behind socket file descriptors. The ideal this mini-library aims for is that client programs can connect up to lots of network endpoints, and then publish lots of little binary messages to all endpoints simultaneously without peering into the eldritch horror that is the Real World of Networked Computers.

The guts of the implementation center around a struct socket which the client program drives via library functions — connect to this host and that host, send a message to both, wait for a message from either. Each connection has its own socket file descriptor, and it's the job of my library thing to multiplex them properly.

Easy. epoll(7) to the rescue!

It looks something like this (guaranteed to not actually compile). First, the data structures:

struct peer {
  int   fd;      /* socket file descriptor                */
  int   epfd;    /* epoll(7) management descriptor        */
  void *tx;      /* some sort of transmission queue       */

  struct peer *next;

struct socket {
  struct peer *peers;

The tx member of struct peer is a magic, hand-wavy queue. Don't fret too much about it; but trust me, it keeps an ordered list of the messages that still need to be written to this peer.

That's where sendit() comes in:

void sendit(struct socket *sock, void *msg) {
  struct peer *p;

  /* don't write(2) anything yet; just enqueue for tx */
  for (p = sock->peers; p; p = p->next) {
    p->tx = enqueue(p->tx, msg);

This is a PUBLISH model of communication; all of the connected endpoints will receive a copy of each message we send. No round-robin. No LRU.

To do the heavy lifting of the actual send / receive, we turn to pollit:

void pollit(struct socket *sock) {
  int nfd;
  struct peer *p;
  struct epoll_event ev[MAX_EVENTS];

  nfd = epoll_wait(sock->epfd, ev, MAX_EVENTS, -1);
  for (i = 0; i < nfd; i++) {
    for (p = sock->peers; p; p = p->next) {
      if (ev[i].data.fd != p->fd) continue;

      if (ev[i].events & EPOLLIN)
        recvit(sock, p);

      if (ev[i].events & EPOLLOUT
          && p->tx)                   /* my FATAL MISTAKE! */
        sendit(sock, p);

My fatal mistake was in adding that && p->tx compound conditional to the EPOLLOUT check. I started seeing pollit() called multiple times a second, and not actually blocking as it waited for something worthwhile to do. A CPU busy-loop, and my battery was not thrilled. (Fans were pretty excited about it though)

To understand why, I had to take a step back and think long and hard about what it means for a network socket to be readable or writable. At first blush, it seems straightforward - can I send a packet to the other side?

Except it's a little more complicated than that, isn't it?

It may surprise you to find out, but userspace applications literally have no way of knowing when the packets are sent. That all happens inside the kernel, and the kernel is very territorial. When you write(2) to a network socket, the "packet" is just sitting in a buffer somewhere inside the kernel, right next to the network card. It may get sent right away, it may wait for Nagle's Algorithm to run its course. It may wait for the receiving end to open up its TCP window.

When epoll_wait checks to see if the socket file descriptor is writable, it's really checking to see how much space is available in that kernel buffer. If the kernel has the room, for anything (even just a single octet!) epoll will flag it as EPOLLOUT.

epoll doesn't care about my && p->tx.

To get the desired behavior, I ended up having to perform a delicate, if commonsensical dance with epoll — when a message is queued up, tell epoll to watch that peer's socket descriptor for writability. Once it is sent, and the queue is empty, tell epoll to knock it off.

Happy Hacking!

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently working on Bolo.