maps, smaps and Memory Stats!

I work in monitoring, and one of the things that I have to deal with is process metrics. People care (or at least should care) about all kinds of things related to their processes. Is it running? Is there only one? How many open files does each have? How many threads are in the parent process?

How much memory is the process (and its children) using?

That last one is one of the trickiest, owing in no small part to the sophistication of modern memory management systems.

But top shows that, right? Kinda.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
22522 jrhunt    20   0 2305908 100820  31824 S   4.6  1.7   0:26.07 banshee
 2760 root      30  10  532132  58300   1584 S   3.3  1.0  96:25.44 vdo-partial-upgr
 2889 jrhunt    20   0 1809616 322004  24348 S   3.3  5.4 231:43.15 compiz
25642 jrhunt    20   0 2171372 668276  67440 S   1.7 11.1 153:55.94 firefox
 1392 root      20   0  435612  96008  61960 S   1.3  1.6 104:41.43 Xorg
 2852 jrhunt    20   0  437844   6448   3612 S   1.0  0.1  19:55.79 pulseaudio
 2989 jrhunt    20   0  408736  11316   7912 S   1.0  0.2  72:32.86 indicator-multi
  427 root     -51   0       0      0      0 S   0.7  0.0  16:55.50 irq/44-iwlwifi
 2789 jrhunt    20   0  669608  52004  12692 S   0.7  0.9  46:39.71 unity-panel-ser
  907 root      20   0       0      0      0 S   0.3  0.0   7:35.90 rts5139-polling

Easy as pi. Here, banshee is using 2.3G of virtual memory and 100M of RAM, right?

Hardly.

Unless banshee is statically compiled, it's sharing memory with other processes that are using the same dynamic libraries. To see exactly what ranges of the process address space are mapped to what object files / memory-mapped files, look at /proc/$$/maps:

... snip ...
7fd65ccf1000-7fd65ccf2000 rw-p 00105000 08:02 1845997  /lib/x86_64-linux-gnu/libm-2.19.so
7fd65ccf2000-7fd65cd15000 r-xp 00000000 08:02 1845995  /lib/x86_64-linux-gnu/ld-2.19.so
7fd65cd15000-7fd65cd16000 r--s 00000000 08:02 5359968  /var/cache/fontconfig/1ac9eb803944fde146138c791f5cc56a-le64.cache-4
7fd65cd16000-7fd65cd1a000 r--s 00000000 08:02 5251626  /var/cache/fontconfig/4d6aee6d44eccb37054d3216e945f618-le64.cache-4
7fd65cd1a000-7fd65cd29000 r--p 00000000 08:02 1576851  /usr/lib/mono/gac/Mono.Cairo/4.0.0.0__0738eb9f132ed756/Mono.Cairo.dll
7fd65cd29000-7fd65cd5c000 r--p 00000000 08:02 3545527  /usr/lib/banshee/Hyena.dll
7fd65cd5c000-7fd65cd7a000 r--p 00000000 08:02 832305   /usr/lib/mono/gac/dbus-sharp/1.0.0.0__5675b0c3093115b5/dbus-sharp.dll
7fd65cd7a000-7fd65cd90000 r--p 00000000 08:02 832327   /usr/lib/mono/gac/glib-sharp/2.12.0.0__35e10195dab3c99f/glib-sharp.dll
7fd65cd90000-7fd65cda0000 rw-p 00000000 00:00 0
7fd65cda0000-7fd65cda2000 r--s 00000000 08:02 5358353  /var/cache/fontconfig/767a8244fc0220cfb567a839d0392e0b-le64.cache-4
7fd65cda2000-7fd65cda7000 r--s 00000000 08:02 5353921  /var/cache/fontconfig/7ef2298fde41cc6eeb7af42e48b7d293-le64.cache-4
7fd65cda7000-7fd65cdb0000 r--p 00000000 08:02 3545605  /usr/lib/banshee/Extensions/Banshee.Fixup.dll
7fd65cdb0000-7fd65cdc0000 rw-p 00000000 00:00 0
... snip ...

Since the ld-2.19.so library is mapped in r-xp mode — i.e. not writable — it's a safe bet that anything else mapping that library is sharing pages with us. Even libm-2.19.so, which is writable, is probably also sharing pages, since the p means private with copy-on-write semantics.

This is noteworthy. If the process does write to the libm's memory region, the kernel will craftily allocate a new page, copy the original data to it, and present that to the process. This new page will belong exclusively to us, and we should count that page towards our overall process memory footprint, while still discounting the untouched pages that are still shared.

Enter /proc/$$/smaps.

A coworker showed me this file, and its exhaustive accounting of memory mappings that makes the maps file look like a a tweet from the kernel (#proc #systemstats #YOLO)

7fd65ccf1000-7fd65ccf2000 rw-p 00105000 08:02 1845997  /lib/x86_64-linux-gnu/libm-2.19.so
Size:                  4 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd wr mr mw me ac sd

7fd65ccf2000-7fd65cd15000 r-xp 00000000 08:02 1845995  /lib/x86_64-linux-gnu/ld-2.19.so
Size:                140 kB
Rss:                 124 kB
Pss:                   1 kB
Shared_Clean:        124 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:          124 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
VmFlags: rd ex mr mw me dw sd

Yeah, that's way more information than we're used to from the old maps file. But what does it all mean? You can get some of the basics from proc(5), but my local copy didn't even mention the Pss column.

Why not experiment?

(I'm glad you asked. What follows is a safari into the uncharted jungle of memory management, aided by our trusty friend Exploratory Programming in C!)

Exploratory Programming in C

I love exploratory programming. Home directories on hundreds of servers I've managed over the years are littered with files named t.pl or x.sh. What better way to verify a language or OS feature than to just knock out a quick script and see what happens?

Naturally, when confronted with the compelling but coy smaps file, my first thought was to write a small test script. But Perl and Bash just would not do — we need something close to the machine. Ah yes, C. For the level of control over memory, and its type (heap-allocated, static, stack-allocated, etc.), C is the easiest and best choice.

For those of you who (like me) have to get your hands into things, I've posted all of the code, along with a Makefile, to github.

Some Helper Code

To make this easier, I've written a helper script and some convenience C functions; these are intended to make it easier to see both what is going on with the memory mappings, and to distill the test cases down to their base essentials.

Establishing a Baseline

Everything has overhead, even close-to-bare-metal C programs. Before we start experimenting, let's establish a baseline with the null program:

#include "lib.c"

int main(int argc, char **argv)
{
    return interlude();
}

Why yes, I am including another C source file (lib.c) — that's not a typo. Doing it this way, instead of as another translation unit, makes the build process simpler; just issue a make null to build null.c!

Here's a sample run:

$ ./null
pid 11091
------------------------------------------
go check /proc/11091/smaps; I'll wait...
press <Enter> when you're done

With that running in one console, here's what the diag script gives us:

$ ./diag 11091
[mmap]:
  private      4.0 k [clean]      48.0 k [dirty]
   shared        -   [clean]         -   [dirty]

[stack]:
  private        -   [clean]      16.0 k [dirty]
   shared        -   [clean]         -   [dirty]

Without actually doing anything, we've allocated 16 kilobytes of stack and 52 kB of mmapped memory, 4 kB clean and the remaining 48 kB dirty. That memory is probably brought in by libc (the only library we've linked with).

Static Buffers / The Data Segment

For our first experiment, let's see what happens when we allocate memory statically:

#include "lib.c"

static char buf[32 MB] = {0};
int main(int argc, char **argv)
{
    randomize(buf, 32 MB);
    return interlude();
}

(check lib.c for the definition of the MB macro)

The 32 MB buffer is static; which means that 32 megabytes of buffer space will be reserved in the .data segment of the ELF image. This should show up against the [mmap] section, because the binary image is memory-mapped into the address space of the running process.

[mmap]:
  private      4.0 k [clean]     <b> 32.0 M </b>[dirty]
   shared        -   [clean]         -   [dirty]

[stack]:
  private        -   [clean]      16.0 k [dirty]
   shared        -   [clean]         -   [dirty]

Sure enough, there's 32M of dirty private memory. If you recall our null test earlier, the stack still has 16 kB allocated, and we see our 4 kB of private clean mmap'd memory.

Allocating On The Stack

Static allocation is boring. What happens if we allocated on the stack?

#include "lib.c"

int main (int argc, char **argv)
{
    char buf[28 KB] = {0};
    randomize(buf, 28 KB);
    return interlude();
}

(again, look at lib.c for the KB macro definition)

Simple enough; allocate 28 kilobytes on the activation record for main(). Starting from our null baseline of 16 kB stack, we should expect the stack to increase by 28 kB to 44 kB, all of it private and marked as dirty.

Here's what ./diag has to say:

[stack]:
  private        -   [clean]      44.0 k [dirty]
   shared        -   [clean]         -   [dirty]

Heap Heap Heap

The heap, or free store, is an area filled with blocks of memory (often of different sizes) where calls like malloc, calloc, realloc, and friends play. Let's see if we can do some heap allocations and see what smaps does.

#include "lib.c"

int main(int argc, char **argv)
{
    dirty(16 MB);
    clean(32 MB);
    return interlude();
}

Here we've allocated a total of 48 mb on the heap; 16 mb will be changed, and we'll leave 32 mb untouched. As we'll see, however, that doesn't matter for heap accounting - when you get it, it's in RAM whether or not you use it.

[heap]:
  private        -   [clean]      48.8 M [dirty]
   shared        -   [clean]         -   [dirty]

See?

If you look closely, you'll see we actually have 48.8 mb in our heap mapping. This is overhead of our malloc implementation. At some point, the breakpoint had to be moved to accomodate our memory requests, via sbrk(2). Rather than request the exact amount required, the glibc malloc implementation requests a bit extra, to handle future requests.

Inherited Memory and fork(2)

When a process forks, it becomes two processes, executing the same code, with the same memory segments. Older UNIX implementations naively copied every single page of memory from the parent to the child, and they resumed execution with completely different memory.

Turns out, the normal use case for fork(2) is to follow it immediately with an execve(2) (or a variant) int the child process, effectively blowing away all that carefully copied memory and starting over with a new program image.

BSD initially solved this by punting the optimization back to the programmer; they introduced the vfork(2) system call that skipped all that memory copying. Linux (and eventually all the other modern UNIXes, including BSD) opted for a thing called copy-on-write, or CoW.

Copy-on-write semantics work like this: child address space is mapped to the same backing pages (RAM) as the parent, except that when the child attempts to write to one of those pages, the kernel transparently copies the memory contents to a new, dedicated page, before carrying out the write.

This speeds up fork+exec considerably; no more time wasted copying all that memory. It also speeds up fork in other scenarios, especially when the child ignores most of its parent's memory space.

The downside: it really complicates memory usage analysis.

Let's say you want to know how much memory your Apache web server is using, including the master listener process and all of the child worker processes. (if you don't like Apache, substitute your favorite multi-process system).

Clearly, you need to know how much heap / stack / mmap'd memory the parent process is running, and we now know how to account for that. Looking at the child processes brings up a different set of problems, and brings us to the distinction of private memory vs. shared memory.

But first, an example:

#include "lib.c"

int main(int argc, char **argv)
{
    char *buf = malloc(64 KB);
    randomize(buf, 64 KB);
    interlude();

    pid_t pid = fork();
    assert(pid >= 0);
    if (pid == 0) {
        randomize(buf, 16 KB);
        return interlude();
    }

    int st;
    waitpid(pid, &st, 0);
    return st;
}

The parent allocates 64 kB on the heap, and then forks a child process which will update 16 kB of that buffer. This should trigger our copy-on-write semantics nicely.

You'll notice we call interlude() twice; once right after the parent allocations, but before the fork, and again at the end. This lets us inspect the state of the parent before the fork:

[heap]:
  private        -   [clean]      68.0 k [dirty]
   shared        -   [clean]         -   [dirty]

(I've left off the [stack] section, as it has no bearing on CoW semantics)

As expected, there's 64 kB of heap usage (plus 4 kB for malloc overhead). Pressing <enter> at the interlude prompt will continue on with the fork and subsequent child shenanigans, and then we can inspect the parent process again:

[heap]:
  private        -   [clean]      20.0 k [dirty]
   shared        -   [clean]      48.0 k [dirty]

After the fork, the parent is now sharing 48 kB with the child process, which corresponds to the parts of buf that have not yet been overwritten by the child process. 20 kB of heap (really, 4 kB of overhead and 16 kB of buf) is now marked as private.

Let's look at the child:

[heap]:
  private        -   [clean]      20.0 k [dirty]
   shared        -   [clean]      48.0 k [dirty]

It looks identical, because it's sharing the same amount of memory with the parent (those last 48 kB of buf), and has its own copy of heap overhead (4 kB) and the 16 kB that it overwrote.

Fun With mmap(2)

So far we've looked at static allocation, stack allocation and heap allocation. Now, let's turn to the fourth allocation discipline, mmap(2)

The only system call in our line-up, mmap(2) creates a new mapping in the process address space. It is directly responsible for the creation of a new section in our smaps file. These mappings can be file-backed or anonymous. They can be marked readable, writeable, executable or a combination thereof. They can be reserved for private use, or shared between processes.

Here's mmap.c, an experiment to see what happens with each type of mapping:

#include "lib.c"

int main(int argc, char **argv)
{
    /* inert map (never modified) */
    char *inert = mmap(NULL, 16 KB,
        PROT_READ|PROT_WRITE,
        MAP_ANONYMOUS|MAP_PRIVATE,
        -1, 0);

    /* anonymous, private mmap */
    char *anon_priv = mmap(NULL, 32 KB,
        PROT_READ|PROT_WRITE,
        MAP_ANONYMOUS|MAP_PRIVATE,
        -1, 0);
    randomize(anon_priv, 32 KB);

    /* anonymous, shared map */
    char *anon_shared = mmap(NULL, 64 KB,
        PROT_READ|PROT_WRITE,
        MAP_ANONYMOUS|MAP_SHARED,
        -1, 0);
    randomize(anon_shared, 64 KB);

    /* private, file-backed map */
    int fd = open("data/256k", O_RDWR);
    assert(fd >= 0);
    char *file = mmap(NULL, 256 KB,
        PROT_READ|PROT_WRITE,
        MAP_PRIVATE,
        fd, 0);
    randomize(file, 128 KB);

    return interlude();
}

The code is long, because of the line-wraps, but it's not too complicated.

First, we create a new anonymous private map of 16 kB, which we never use. Anonymous maps can only be backed by physical RAM (or swap space), because they have no file backing. The Linux kernel will set up the address range, but doesn't actually allocate any physical RAM, since we haven't tried to write to any address in the new range.

Second, we create another anonymous private map. This one will be 32 kB, and we will write to all of it; forcing the kernel to provide pages in RAM for us.

Then, we create a 64 kB anonymous map and initialize it with data. This mapping will be marked as shareable, via MAP_SHARED.

Finally, we open a 256 kB file in read/write mode, and map it into memory space. For fun, we'll go ahead and write all over the first half of the file (128 kB), to force the kernel to retrieve pages from disk and stuck them in RAM.

Here's what diag has to say:

[mmap]:
  private        -   [clean]     272.0 k [dirty]
   shared        -   [clean]         -   [dirty]

272 kB works out to exactly 48 + 32 + 64 + 128. Recall that our baseline null program had 48 kB of mmap'd memory, probably from glibc or the C runtime itself. Add the 32 kB anonymous private map, the 64 kB anonymous shared map and half (128 kB) of the file map, and you get 272 kB!

How the kernel handles the file-backed memory map is interesting, and speaks to the power of mmap(2). Since the process only wrote to the first 128 kB of the file, the kernel only had to allocate memory for that section. This also proves that the smaps file only reports pages that are actually in RAM.

Confusing [heap] and [mmap]

So, [heap] is for malloc() and [mmap] is for mmap().

If only it were that simple.

If you check the man page for malloc(3), you'll find this nifty little gem in the NOTES section:

Normally, malloc() allocates memory from the heap, and adjusts the size
of the heap as required, using sbrk(2). When allocating blocks of memory
larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation
allocates the memory as a private anonymous mapping using mmap(2).
MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3).

So, if you try to heap-allocate more than 128 kB (using the default MMAP_THRESHOLD), malloc() sneakily calls mmap() to do the heavy lifting. Normally this is totally fine and transparent, but it does change the way memory is reported.

To illustrate, here's a test case that allocates two buffers, one under the 128 kB threshold, and one over:

#include "lib.c"

int main(int argc, char **argv)
{
    char *under = malloc(96 KB);
    randomize(under, 96 KB);

    char *over = malloc(256 KB);
    randomize(over, 256 KB);

    return interlude();
}

Normally, we would expect this to report ~352 kB (95 kB + 256 kB) in the [heap] section.

[heap]:
  private        -   [clean]     100.0 k [dirty]
   shared        -   [clean]         -   [dirty]

[mmap]:
  private        -   [clean]     308.0 k [dirty]
   shared        -   [clean]         -   [dirty]

You can plainly see that the 96 kB allocation was done in [heap], while the 256 kB passed the MMAP_THRESHOLD and was mmap'd. The 308 kB number comes from our baseline mmap of 56 kB + 256 kB from the malloc() call.

Putting It All Together

Hopefully you've learned a bit about how memory is organized inside a running Linux process, and how the kernel handles the different allocation methods and process forking.

Here's few parting thoughts on using the data in smaps to your advantage:

  1. Private memory always belongs just to this process
  2. Shared memory may be shared with parent and/or children processes
  3. Sometimes [heap] isn't the full story on heap allocations
  4. Memory is a complicated topic

Happy Hacking!

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently working on Rook.