Open Files (Per Process)

Over the weekend, something bad happened on the network.

(That's the opening line of the scariest techno-thriller on-call personnel can think of)

Anyway, these network goings-ons caused an interesting situation for bolo, the metrics gathering and monitoring system. One of the subnets was able to open a connection to the bolo core, but unable to properly close it (or even, it is believed, to send any useful data). About 30 machines in this network happily spent the rest of the weekend opening socket after socket, and then leaking them.

After less than a day, this handful of machines was able to run the bolo core out of its hard ulimit on open file descriptors, about 65k. This is where the real fun started.

Other clients, in correctly routed networks, became unable to nail up connections to the core, and started stacking TCP connections in the SYN_SENT state. When enough of these were created, the packet loss began.

The operational fix, for those of you who are interested in that sort of thing, was to firewall off the misbehaving network and restart the bolo process to start over on file descriptors.

But that's not what this post is about — it's about detection.

On any other machine, or for any other process, a simple monitoring check (like the process collector in bolo-collectors) would be sufficient. Let the monitoring agent peer into /proc, tally the open files, and relay that data up to the bolo core. But when the process that may exhaust open files is the bolo core, you run into the problem of not being able to submit the data. No graph. No alert. No visibility.

The logical course of action then, is to teach bolo how to count up its open file descriptors and track it internally, without needing a socket or pipe.

Obviously, exec-ing the process collector is out; in the worst case, we have no file descriptors for the pipe to the child.

Looping over /proc/$$/fd is similarly infeasible; opendir(3) needs a descriptor.

Luckily, an unlikely combination of getrlimit(2) and poll(2) does the trick nicely.

To The Limit

Under POSIX-compliant operating systems (Linux, *BSD, OS X, etc.) all process resource usage is constrained by quotas or limits. For example, there is a limit to the number of pending signals a process can have, and another limit for stack size. But the most well-known of these is the number of open files limit, nofile.

To see your shell's nofile limit, run ulimit -n:

$ ulimit -n
1024

There are actually two limit values for each type, a soft limit and a hard limit. These are also referred to as the current limit and the maximum limit, because a process can elect to increase its soft (or current) limit up to, but no further than, its hard (or maximum) limit.

The ulimit shell builtin can show you the soft or hard limit:

$ ulimit -Sn
1024

$ ulimit -Hn
4096

Here, my shell can create up to 1024 files (the current / soft limit). If I want to, I can increase this limit up to 4096 files. 4097, as they say, is right out.

The getrlimit(2) system call lets us programmatically determine our limits from inside of C.

#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>

int main(int argc, char **argv)
{
    struct rlimit lim;
    if (getrlimit(RLIMIT_NOFILE, &lim) != 0)
        return 1; /* bail! */

    printf("soft = %lu; hard = %lu\n",
           lim.rlim_cur, lim.rlim_max);
    return 0;
}

So now, bolo can definitely keep track of its limits. Bo-ring.

Poll Dancing

The poll(2) system call is almost exclusively used for I/O multiplexing. You've got a bunch of file descriptors, and you want to pick from the first available for reading and/or writing. The kernel knows all and sees all, so if you ask nicely (and allocate your pollfd structures properly) it will block until any of them are interesting enough to deal with.

So what does that have to do with tallying the open file descriptors?

As it turns out, if you tell poll(2) to watch a file descriptor that isn't open, you get back an output event of POLLNVAL, indicating that this descriptor is invalid, i.e. not a real descriptor.

Don't believe me? Try this out:

#include <stdio.h>
#include <poll.h>

int main(int argc, char **argv)
{
    struct pollfd fds[1];

    fds[0].fd     = 2; /* standard error */
    fds[0].events = 0; /* more on this later */

    close(2);

    if (poll(fds, 1, 0) < 0)
        return 1; /* bail! */

    if (fds[0].revents & POLLNVAL)
        printf("fd 2 is not a real file descriptor\n");
    else
        printf("fd 2 *is* a real file descriptor (bug...)\n");

    return 0;
}

There's a lot going on here, but we'll highlight the important bits.

The 3rd argument to poll(2) is 0. This is an optional timeout parameter, and it specifies the number of milliseconds the kernel will block before returning to the caller. If we were doing normal I/O multiplexing, this value would be -1 (block forever) or some useful value (like 2000ms). Passing 0 causes poll(2) to return immediately, even if none of the file descriptors are ready, which dovetails with the next point...

The pollfd.events attribute is also 0. The events attribute lets us specify whether we are interested in knowing when a file descriptor becomes readable, writable, has an error, etc. Setting it to 0 means "I don't care one wit about the condition of the file descriptor". Remember, we are only interested in that POLLNVAL error flag.

We close standard error (fd 2). The example doesn't work otherwise. The file descriptor has to be closed for poll(2) to flag us that it is invalid.

Even though we didn't specify any events, we still get revents. This is the crux of the whole hack: after poll(2) return immediately, we can see that file descriptor 2 is no longer a valid file descriptor.

When A Plan Comes Together

Now we are armed with the following critical techniques:

We know how to get the maximum number of file descriptors (via getrlimit(2))
We can determine if an arbitrary file descriptor is valid or not.

What if we were to call poll(2) on all of the possible file descriptors? Since we know what the limit is, and file descriptors are just integers (per POSIX), we can do that:

#include <stdio.h>
#include <stdlib.h>            /* for calloc(3) */
#include <sys/time.h>
#include <sys/resource.h>
#include <poll.h>

int main(int argc, char **argv)
{
    struct rlimit lim;

    if (getrlimit(RLIMIT_NOFILE, &lim) != 0)
        return 1; /* bail! */

    struct pollfd *fds = calloc(lim.rlim_cur, sizeof(struct pollfd));
    if (!fds)
        return 1; /* bail! */

    int i;
    for (i = 0; i < lim.rlim_cur; i++) {
        fds[i].fd     = i;
        fds[i].events = 0;
    }

    if (poll(fds, lim.rlim_cur, 0) < 0)
        return 1; /* bail! */

    unsigned long nfds = lim.rlim_cur;
    for (i = 0; i < lim.rlim_cur; i++)
        if (fds[i].revents & POLLNVAL)
            nfds--; /* not a real fd, discount it */

    printf("%lu open files\n", nfds);
    return 0;
}

Finally, a single system call that can reliably (and quickly) determine how many open file descriptors the current process has. Let's do the breakdown, shall we?

We start with the nofile limit. This is the soft limit, because the hard limit has no bearing on the number of open file descriptors.

Then, allocate enough pollfd structures to represent all of the possible file descriptors. As before, we set the events attribut to 0, because we only care about the POLLNVAL error condition.

Counting is kind of backwards, but it works! Instead of starting at 0 and counting up, I chose to start at $max and decrement the count for every invalid file descriptor. You can do it either way.

No new file descriptors were allocated. In an earlier (and less successful) attempt, I employed epoll(2), thinking that it would be able to handle a larger number of file descriptors. Unfortunately, as the man page states:

epoll_create() returns a file descriptor referring to the new epoll instance.

So that won't work when we're completely out of file descriptors...

To A More Ruggedized Implementation

For my immediate needs (making bolo able to track its own file descriptor usage and alert upon exhaustion) I ended up writing a function called open_files that returns both resource limits alongside the number of open fds, via output parameters.

The code can be found here.

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring Kubernetes, as both a floor wax and a dessert topping.