A lot of words get written about how Docker revolutionizes delivery of the output of the software development process. Without a doubt, containers-as-super-packages are phenomenally useful for wrapping up binaries with configuration files, helper programs, shared libraries, and language runtimes. I’m not here to talk about that today.
No, I want to talk about how I use Docker while writing software–and how you can too!
Here’s a Dockerfile that makes my life easier when I’m working in Perl:
FROM ubuntu:20.04
RUN apt-get update \
&& apt-get install -y carton \
&& rm -rf /var/lib/apt/lists
WORKDIR /app
COPY cpanfile .
RUN carton install
# start here when you figure out how to run the Perl-y thing...
Here’s what I do for Common Lisp work:
FROM ubuntu:20.04
RUN apt-get update \
&& apt-get install -y sbcl \
&& rm -rf /var/lib/apt/lists
ADD https://beta.quicklisp.org/quicklisp.lisp /tmp
# start here when you figure out how to run the Lisp-y thing...
These “partial images” contain the seeds of a real production deployment, but one that can be used to incrementally develop the software in the same environment.
It gets even more interesting when we start relying on libraries that are outside of the purview of the language libraries themselves; usually because they are written in C. For example, if I add hunchentoot
, a pure-Lisp HTTP server, to my Common Lisp project, it will pull in ironclad
, for TLS capabilities, which depends on OpenSSL.
To accommodate, I just have to amend the apt-get install
command from this:
RUN apt-get update \
&& apt-get install -y sbcl \
&& rm -rf /var/lib/apt/lists
to this:
RUN apt-get update \
&& apt-get install -y sbcl libssl-dev \
&& rm -rf /var/lib/apt/lists
Who builds an application that doesn’t handle data? I usually need either a Redis cluster, a PostgreSQL instance, or at least a persistent file system to keep stuff around in between application server reboots. For that, I use Docker Compose and off-the-shelf images.
For example, the other day I was building a content-addressable blob store based on SHA-3 hashing. Blobs live in files under the file system, named after the checksum of their contents. I also needed the ability to group blobs arbitrarily (via tags) and store keyed metadata (file size, original file name, etc.). Metadata lives in a PostgreSQL relational database.
To spin everything up reliably, I wrote a Compose recipe. Here it is:
version: '3'
services:
db:
image: postgres
environment:
POSTGRES_PASSWORD: sekrit
volumes:
./_/pg:/var/lib/postgresql/pgdata
./schema.sql:/docker-entrypoint-initdb.d/10-schema.sql
api:
build:
context: .
environment:
DB_HOST: db
volumes:
- ./_/store:/store
- ./api:/app:ro
ports:
- 7000:7000
I get two things out of this: (1) a persistent database, and (2) a place to execute code in a REPL (this is a Common Lisp project). Should I want to start over, I can drop the containers, wipe the persistent storage directory (_
) and start it back up again:
$ docker-compose down
$ sudo rm -rf _
$ docker-compose up -d
A major advantage Lisps have over other languages is the utility of the REPL–an interactive top-level expression evaluator. With a spinning api
container, I can access the REPL and play around with code as it evolves:
$ docker exec -it the_api_1 sbcl
This is SBCL 2.0.1.debian, an implementation of ANSI Common Lisp.
More information about SBCL is available at .
SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (+ 1 2 3)
6
This is handy, because it allows me to edit files outside of the container (The ./api:/app
volume mount does that), while still relying on in situ execution, evaluation, and exploration.
One of my earliest memories of computers starts on an airplane.
When I was little, my mother and I would head down to South Florida for vacation every summer. She had friends who lived down there and they could spare us a room for the week. It was a change in scenery that didn’t affect the pocket book too dearly.
Often we drove, two days spent sitting in the cab of a 1989 Ford Ranger pickup. We would wind our way through Illinois, to Tennessee and Kentucky, stopping in Georgia to visit with distant relatives — my grandmother’s brothers, I was told. It was a long trip for a kid, back before the Internet, iPhones, Netflix, and screens everywhere you looked.
I entertained myself by reading maps. Maps printed on paper, in a bewildering array of bright, vivid colors. Numbers and letters and names. Borders and boundaries and routes. We played license plate bingo. We counted telephone poles.
One year, however, we had saved up enough money, and we flew. It was amazing. The seats were comfortable, the flight attendants brought us peanuts and soda, and the cabin was air-conditioned. Truly, this was the only way to travel.
The plane was huge, and our tickets were discount, fill-the-flight affairs. I was not seated next to my mother. Instead, I sat next to a middle-aged business man traveling cross-country with a most ingenious device: a laptop computer.
This was probably 1993. The laptop was huge by today’s standards, and it ran Windows 3.1 (though I had no notion of that at the time). It had a word processor, which the nice gentleman was studiously plunking words into, pulling together some sort of business report or memorandum. I watched, spellbound.
At one point he must have noticed me watching him, because he turned and asked me if I had ever seen a computer before. I shook my head, half-embarrassed and speechless. He turned the laptop so that it partially faced me and started showing me all the neat things it could do.
He pulled up a clip art library from within the word processor – an endless collection of garish pixelated artwork that looked so modern and futuristic to me then. He opened a drawing application (MS Paint, I’m sure of it now) and let me doodle, fumbling with the mid-keyboard mouse analog that passed for a “pointer device” in those days.
For the first half of the flight from Chicago to Fort Lauderdale, I asked question after question after question. How did the computer know how to draw? Who made all the pictures in the clip art library? Why were the letters on the keyboard all jumbled up? How did the little red nub in between the “B” and the “N” ever come by the name of “mouse”? Could the writing app write by itself? How did the computer work without an extension cord? Question after question after question.
Eventually, the nice man with the computer called over the flight attendant and asked to be seated elsewhere. But I never quite got over the possibilities of that little bit of technology, five thousand feet above ground.
]]>Today, I start a new chapter in my career as a technologist and problem-solver, as I officially join Vivanti Consulting as a founding member and Principal Consultant.
In honor of the occasion, I want to tell you a story about one of my earliest consulting screw-ups.
In the Fall of 2015, I was coming off of my first consulting engagement with a new firm to go spin up a brand new opportunity as a lead. This was going to be my moment to shine. I was scared, optimistic, and exhilarated all at once. For clarity’s sake, let’s call this client FlipChord, although that is not their name.
Things went well for the first month or so. We demonstrated that our team’s grasp on the chosen technology stack was both deep and broad. We helped them spin up some impressive automation. We even pitched in and helped them re-target some of their deep-down containerization code, to use a different runtime. You see, FlipChord had determined that to succeed, they needed to build their application deployment platform — a PaaS of sorts — in-house, from scratch.
This caused quite the stir internally. Senior technical staff (all consultants) were concerned that FlipChord was just reinventing the wheel. Management was concerned that FlipChord would fail as a result, and we would be to blame. Both groups started asking questions.
It is important to note that FlipChord was quite satisfied with our performance thus far on the engagement. We were hitting our milestones. They were delivering their platform. Their software teams were positively delighted. They certainly hadn’t considered it a failure.
Until we intervened.
In January, I was invited to a zoom meeting that included my higher-ups and the complete management chain at FlipChord. Some of the names on the invite were only academically familiar to me, in the way you would recognize a person because of their parking spot placard, or the fact that their name is on the building. The hairs on the back of my neck started to stand on-end as I read through the seemingly endless list of invitees. I had a bad feeling about this intervention.
The meeting was scheduled for an hour, but we didn’t use more than ten minutes.
First, my management made their case. That FlipChord was doing things wrong. That they were bound to fail. That we knew what was correct, and we would be happy to help them implement it.
Then, silence.
I will never forget that awful silence. No one was muted, but they might as well have been. My face grew hot. My mouth dried up. I could taste the shame and the embarrassment.
We should not have done this.
After an awkward silence, FlipChord management asked a few questions, none of which I can remember. They then thanked us, and informed us that they would be exercising the termination clause of our contract.
I was devastated.
I tell this story to people not so that they will pity me, but to understand two very important factors in consulting: empowerment and championing.
Empowerment is what happens when you let people make their own decisions. It’s a devilishly simple thing to do. As a leader, if you direct someone to do something, you give them the authority, the autonomy, and the equipment to do it. And then you get out of the way. You tell them what is to be done, but never how to do it.
Championing describes the complex relationship a consultant has with their client. It’s a relationship quite unlike the employer / employee relationship most of the workforce is accustomed to. A consultant works for the firm, but champions the client. The firm pays you, but the client retains the firm, based on you. Your work. Your advice. Your assistance.
The consultant’s sole goal is to make the client successful, in the ways the client wants to succeed. Opinions are offered during decision-making, sure. But once the course is selected and the approach identified, the consultant does everything they can to get the client there. Anything less is a disservice to the client, and cheapens the trade of consulting.
(If you’re a lawyer, or a financial planner with fiduciary responsibility, you understand this concept intrinsically. I pray that technology consulting gets to where you are, some day.)
With FlipChord, management’s intervention destroyed my empowerment. Insistence on “the correct way” was about as far from championing as you can get. I personally failed the client by not pushing back; by not standing up for their interests. I should have stopped that meeting from happening. I should have championed them.
In the end, we lost the gig, but more importantly, we lost the opportunity to help FlipChord in the future. We lost their trust.
And nobody felt good about that.
At Vivanti, my co-founders and I are trying to build a better consultancy. One based on empowerment, and steeped in this idea that a consultant’s first responsibility is to champion the client.
We’re hiring. Come join us.
]]>I’ve been fascinated by Lisp ever since I first discovered it and it “clicked” for me.
(defun hello (world)
(format t "Hello, ~a!~$" world""))
When I talk about Lisp, I get animated. It unnerves a lot of people, I think. I can see it in their eyes, that slight sense of unease. Is this guy okay? I mean, it’s just a freakin’ programming language!
That’s where I’d probably differ. It’s not just a programming language. It’s a tool for structured, unstructured thought. To understand what Lisp is, we first have to look at all the things people hate about it.
The first thing people notice about Lisp is that it has an awful lot of parentheses. To programmers more accustomed to Algol-like languages (C, Java, Perl, Javascript and the like), all this punctuation seems unnecessary and off-putting.
But parentheses are, for most people writing most programs, the end of the punctuation. Lisp doesn’t have semicolons. It doesn’t really have mathematical symbols — at least not in the sense that you have to remember special rules for how to use them. It doesn’t have block bracing punctuation either.
Just parentheses.
Some Lisp dialects inject new types of parentheses; Clojure has three distinct sets – ()
, []
, and {}
! While these provide some visual flavor, they don’t provide anything new, structurally speaking, over plain old round parentheses.
Some (now extinct?) dialects of Lisp had the MEGA END BRACKET OF BRACKETY COMPLETENESS, also known by the more unassuming moniker of “]”. When encountering a lone close-square-bracket, these Lisp compilers would dutifully close up all of the open parentheses. This was believed to help beleaguered programmers from having to finish everything that they start.
(defun should-equal (msg got want)
(should msg (equal got want))
(cond ((not (equal got want))
(diag "got ~S~%" got"")
(diag "wanted ~S~%" want] ; i.e. "))))"
The parentheses don’t distract me. They are not a nuisance. They are the whole point.
You see, Lisp lacks syntax. All those parentheses are there to shore up the structure of the program, since we don’t have a syntax to do it. That probably makes no sense, and that’s okay. Let’s look at a language that does have syntax: Perl.
# here's some Perl
if ($a > $b) {
print "a!";
} else {
print "b!";
}
This program, while too trivial to be useful, serves our purposes beautifully. It compares to variables (presumably, numerically) and prints out the name of the variable that is the bigger of the two.
The perl compiler doesn’t actually deal with this syntax; as quickly as it can, it converts the source program text into an abstract syntax tree. Ours looks like something like this:
As the program is compiled (or interpreted, in Perl’s case), this abstract syntax tree is expanded, collapsed, coalesced, and cajoled as various bits of the programming language’s implementation are brought to bear on the problem: what the heck did the programmer mean?
In Lisp, we would write the following:
(if (> a b)
(print "a!")
(print "b!"))
This is literally just a serialization of the abstract syntax tree!
That’s why I keep saying that Lisp doesn’t have syntax (but it does have parentheses!)
Most Algies like to use infix notation for mathematical operators ($$2 + 2 = 4$$, just like you were taught in school), a mix of prefix and postfix notation for increment/decrement operations (++n++
anyone?), prefix for address shenanigans (&val
, *ptr
), and a weird infix+prefix notation for function calls.
Lisp insists on prefix notation for all of this. This gives lots of people the willies.
Admittedly, prefix notation does take some getting used to. Here’s some exercises to get you going.
1 + 2 → (+ 1 2)
++n → (1+ n)
sqrt (25) → (sqrt 25)
Lisp’s use of this somewhat uncomfortable notation is intentional. In Lisp, everything is a function (or a macro, but we’ll get to those in a moment). That includes the arithmetic addition operation.
More interestingly, all functions in Lisp can be applied, with their arguments assembled at runtime by the program doing the calling.
(defun range (a b)
(loop for i from a to b collect i))
(apply #'+ (range 1 100))
This is easy enough to implement in C:
int sum_range (int from, int to) {
int i, sum = 0;
for (i = from; i < to; i++) {
sum += i;
}
return sum;
}
But how do you do that to just the odd numbers? In Lisp, we can filter the arguments before applying them to the + operation:
(apply #'+ (remove-if #'evenp (range 1 100)))
But in C, we have to build a whole new function!
int sum_odd_range(int from, int to) {
int i, sum = 0;
for (i = from; i < to; i++) {
if (i % 2 == 1) {
sum += i;
}
}
}
And what if we want to sum the squares of the odd numbers in our range?
(defun sq (n)
(* n n))
(apply #'+ ; add up the...
(mapcar #'sq ; squares of...
(remove-if #'evenp ; odd numbers...
(range 1 100)))) ; from 1 to 100
I’m not even going to bother with the C implementation, because I’m about to throw you a curve ball – in Lisp we can do this without muddying up the global namespace with a function to multiply two numbers together.
Behold, the lambda!
(apply #'+ ; add up the...
(mapcar (lambda n) (* n n)) ; squares of...
(remove-if #'evenp ; odd numbers...
(range 1 100))) ; from 1 to 100
You can’t do that in C. You can do it in Go, but it’s a bit less elegant:
func sum(from, to int, func keep(int) bool) int {
sum := 0
for i := from; i <= to; i++ {
if keep(from+i) {
sum += from+i
}
}
return sum
}
sum(1, 100, func (x int) bool {
return x % 2 == 1
})
You’re still having to build an edifice (the sum()
function) around the fact that Go’s arithmetic addition operator is special – it exists inside the language proper, because of its special syntax.
Everybody know what macros are, right?
#define for_range(i,from,to) for (i = from; i <= to; i++)
The C pre-processor implements a rudimentary (and that word is doing a lot of work here) macro system that supports inline replacement and some token pasting.
You see, in C, when you use a “constant” like AF_INET
, the C pre-processor replaces all occurrences of the token “AF_INET” with the numeric value that the constant represents. If you dump the symbols in a compiled C program, you’ll see function names and global variables, but you’ll not see a single, solitary #define’d constant. They evaporate.
Because cpp
is just doing search-and-replace against the C program’s source code, you can get pretty tricky with C “macros”. I once wrote a list implementation that relied heavily on pre-processing to introduce new syntax for list iteration:
void do_all_the_things(struct list *things) {
char *thing;
for_each(thing, things) {
printf("A thing: %s\n", thing);
}
}
Here’s the magic pre-processor define for for_each(x,y)
:
#define for_each(var,lst) \
for ((var) = list_item(lst); \
list_has_next(lst); \
(lst) = list_next(lst), \
(var) = list_item(lst))
This only works because the other list*
functions are specifically tuned to this usage, and the C pre-processor drops our (legitimate) for loop preamble into the program, right where our (totally illegal syntax) foreach
loop preamble was.
People say they don’t like macros, or that they are complicated, or too meta, but then they use them every day. Ctrl-C
is a macro. It’s a thing you type on the keyboard, that doesn’t type the thing you typed on the keyboard into the input field. That’s a macro. It’s meta by definition.
Lisp is the only language I’m aware of that has real macros, because Lisp is the only language I’m aware of where the representation of the programs and the representations of the data those program operate on is EXACTLY THE SAME.
Lisp data is often organized into lists of symbols.
Lisp code is always organized into lists of symbols.
That means that Lisp data, operated on by a Lisp program, can produce a second, different Lisp program.
This is the heart of what macros are: a way of programming the compiler.
Here’s a thing you can do in Lisp that you cannot do in any other language: add new (non-)syntax:
(defmacro backwards-and-2 (a b)
`(and ,b ,a))
With that macro defined, whenever I write the code
(backwards-and-2
(if-second-thing)
(if-first-thing))
The macro expander will rewrite that, swapping the expressions and passing them both to the and
form:
(and (if-first-thing)
(if-second-thing))
The Rust community has taken to calling these “zero-cost abstractions” because they evaporate before the code generation phase starts. As far as everyone else is concerned, we ALWAYS wrote the and
's the right way round.
The real power in Lisp macros comes when you realize that the macro expander operates with the full power of Lisp itself behind it. That means we can write macros that do all sorts of deep magic on the expressions we wrap them around, before spitting out code that would be error-prone for us to write ourselves.
Consider:
(defun is-html? (tag)
(or (eq tag 'a) ; ... etc.
(eq tag 'abbr)
(eq tag 'address))))
Wouldn’t it be nicer to say what the HTML tags are, and mechanically synthesize the is-html? predicate?
(defmacro html-tags (tags)
(let ((eqs (mapcar #'(lambda (tag)
`(eq tag ',tag)) tags)))
`(defun is-html? (tag)
(or ,@eqs)))
Now, if we call (html-tags ...)
with all of the known HTML tags, we get (is-html? ...)
for free, several times over!
(html-tags (list
'a 'abbr 'address)) ; ... etc.
It’s way more compact, but it is equivalent (at runtime) to the hand-coded version. The key difference is that I can add a new HTML element by adding its symbol to the argument list.
We don’t even need to specify the arguments literally! We could read the list of HTML tags from a local file if we wanted!
(defun read-all (path)
(with-open-file (in path)
(loop for form = (read in nil nil)
while form collect form))))
(html-tags (read-all "html.tags"))
Now, if we discover a new HTML element, we pop its tag onto the end of the html.tags file, recompile our program and we magically have new code!
Isn’t lack of syntax grand?
]]>A while back, I stumbled across an intriguing DIY community: Book Scanners. They have lots of books. I have lots of books. They want to digitally archive their books. I …
I also want to digitally archive my books, even if I didn’t know it before a few months ago.
Fast-forward to last weekend, when I got my workshop all in order:
You can actually see my first attempt at a book scanning rig, over there on the left. It’s little more than a cardboard box, cut diagonally and opened up, with two cheap (~$60) digital cameras on even cheaper (~$15) mounting arms. Here’s a closer view:
The concept is pretty straightforward: the book lays flat on the cardboard box, at (roughly) a 100° angle, and each camera is pointed at the opposite page. To help the book lay flat, we put two pieces of glass / lexan—I’m using the plates out of some 11×14″ picture frames I picked up on clearance at a craft hobby store—the cameras can see through the glass / lexan and if we angle the light juuuust right, there’s no glare.
While you can run the cameras on battery, and regularly re-charge them, I opted to go for the hard-wired approach, with two of these:
These AC adapters plug into the wall and provide a “blank” battery cartridge that delivers electrons on the same contacts as a real battery. The camera has no idea we’ve tricked into being on all the time!
For control, I’m using a combination of CHDK — the Canon Hack Development Kit, and a Raspberry Pi. If I’d had my druthers, I’d have used a Pi model with enough onboard USB ports (i.e. 2 or more), but all I had on hand was a Model A+ v1.1:
The Pi is hooked up to an external USB hub to get to port capacity. Here’s a rough block diagram:
With the hardware wired up and ready to go, it’s time to talk about the software bits that will make the whole thing scan. The trick with book scanning is to keep the “fiddling with the rig” to a minimum. For that, I’m using PTP – Picture Transfer Protocol, a means of remotely controlling a modern (-ish) digital camera over a USB link. The excellent (and aptly named!) software package chdkptp makes it easier to interact with PTP-enabled CHDK platforms.
For this to work, the cameras need to be running 1.2 of the CHDK firmware. CHDK is a fascinating project in its own right — you burn an image into an SD card, lock it, load it, and boot up the camera. The camera notices that the SD card is locked, and boots off of the card, instead of its onboard ROM. It’s kind of like jailbreaking an iPhone, without all the stress.
The version of CHDK that I’m loading has PTP support built right in, so it can immediately take commands over the connected USB bus. Yes, that’s a bit redundant (universal serial bus bus), but what was I supposed to call it? the US bus?
The chdkptp package, on the other hand, provides the client driver for the PTP conversation. With it, we can issue commands across the USB link (yeah, that works better) and make the cameras do our bidding. Here’s roughly how it works:
$ sudo ./chdkptp-r964/chdkptp.sh -r -elist
-1:Canon PowerShot ELPH 160 b=001 d=022 v=0x4a9 p=0x32aa s=...
-2:Canon PowerShot ELPH 160 b=001 d=021 v=0x4a9 p=0x32aa s=...
That, however, is way too much to type, and my fingers are itching to turn pages, not mash keys. Here’s a wrapper script:
$ cat ~/bin/ptp
#!/bin/sh
set -e
exec sudo $HOME/chdkptp-r964/chdkptp.sh -r -e"$@"
$ ptp list
-1:Canon PowerShot ELPH 160 b=001 d=022 v=0x4a9 p=0x32aa s=...
-2:Canon PowerShot ELPH 160 b=001 d=021 v=0x4a9 p=0x32aa s=...
Much better.
(By the way, those USB device numbers (d019, d020, etc.) will change every time the camera power cycles; as we build out the rest of the automation / control software, we’ll definitely need to keep that in mind)
PTP defines an operation called remote shoot, which will let us take a picture without physically touching the camera body. That’s key, because those stand arms aren’t the most rock solid things in the world.
$ cat shoot
#!/usr/bin/env ptpx
connect -b=001 -d=021
rs test-image
I should mention that the chdkptp binary is a bit awkward to use at times, and I have written this ptpx
wrapper to emulate the behavior of more UNIX-y shell / command interpreters like Bourne Shell:
$ cat ~/bin/ptpx
#!/bin/sh
if [ "x$1" = "x" ]; then
exec sudo $HOME/chdkptp-r964/chdkptp.sh -r -i
else
exec sudo $HOME/chdkptp-r964/chdkptp.sh -r -e"source $1"
fi
return 7
Without arguments,ptpx
starts up an interactive (-i
) shell. With arguments, it sources the first argument (ignoring the rest) in a sense executing its argument. Just like a real shell!
Time to shoot some photos!
$ ./shoot
connected: Canon PowerShot ELPH 160, max packet size 512
ERROR: not in rec mode
ERROR: error on line 3
Most cameras operate in one of two modes: playback and record. In playback mode, you can browse through the photos stored on the camera, muck with settings, change the time, and more. In record mode, the backpanel LCD turns into a pixelated viewfinder, and the camera can actually produce output image files from what’s on the lens.
We want record mode:
$ cat shoot
#!/usr/bin/env ptpx
connect -b=001 -d=021
rec
rs test-image
$ ./shoot
connected: Canon PowerShot ELPH 160, max packet size 512
(flash goes off, camera chimes, and boom! photograph)
$ ls -lah test-image.jpg
-rw-r--r-- 1 root root 4.2M Apr 21 02:08 test-image.jpg
Now we have a working point-and-shoot, which we can access over SSH. Let’s double the fun!
$ cat capture
#!/bin/sh
set -eu
# USAGE: capture NUMBER CAM1 CAM2
n=$1 ; shift
for cam in "$@"; do
cat <<EOF | ptpi &
connect -b=001 -d=$cam
rs img-$n-from-$cam
EOF
done
wait
This script is a bit more involved, but it builds on the same concepts. The ppti
script, however, is new:
#!/bin/sh
f=$ (mktemp ptpXXXXXXXXX)
trap "rm -f ${f}" QUIT TERM INT EXIT
cat >$f
ptpx $f
exit $?
It takes standard input (the “i” stands for “input”), stuffs it into a file on disk, and then calls the ptpx
script on it. At the end of the execution, come hell or high water, the temporary file is removed. Back to capture
!
$ capture 1 21 22
connected: Canon PowerShot ELPH 160, max packet size 512
connected: Canon PowerShot ELPH 160, max packet size 512
Now if we look in the current working directory, we can plainly see our two images:
$ ls -lah
-rw-r--r-- 1 root root 4267460 Apr 21 02:11 img-1-from-21.jpg
-rw-r--r-- 1 root root 5745788 Apr 21 02:11 img-1-from-22.jpg
Eventually, we’ll want to run these through some computer vision code to find page boundaries, correct for skew and warp, re-order the pages, and pop out a PDF. I’m still working on all of that; so check back soon!
]]>Over the years, I've collected a small bag of useful, if somewhat silly, tricks for using Docker (and what I'll call low effort containerization) to make my day-to-day computing life easier. Here are some of my favorites!
If you're a purist for the tool names you've been given by the project developers, it's probably best for you to just skip this section outright.
I hate typing. So much so that I have tons of shell aliases, git aliases, and tiny utilities with small names littered about my systems — and I can't live without them.
Specifically, I can't live without dr
:
$ cat ~/bin/dr
#!/bin/sh
docker run -it --rm "$@"
Seriously, I use this command at least once a day. Anywhere you see a code listing with a docker run
in it, you can bet that I'm actually shortening that to just plain dr
.
The --rm
is important; it tells the Docker daemon to clean up the container filesystem and configuration when the contained process exits. This is important because if you bounce through Docker containers as mush as I'm going to propose you start doing, you'll soon end up with tons of dead containers eating up space on your hard disk.
The -it
is there because primarily, when I'm typing in Docker-y commands, I'm at a terminal (-t
), and I really want to be able to send data through standard input (-i
).
cat
This one seems almost too basic to mention, but I see far too many professionals overlooking the sheer power and awesomeness of cat
combined with docker run
.
I spin a lot of HTTP APIs, and I almost always reach for nginx
when I need a general purpose, lightweight reverse proxy to front them. I also firmly believe in explicitly stating configuration; I am not comfortable "accepting the defaults" of things like the standard nginx
image.
No biggie! The containerization platforms I play with (Docker / Kubernetes) make it trivial to mount in your own files, shadowing the configuration that comes with any particular image. Usually, though, I want to start from the default configuration and either accept it explicitly, or tweak it.
Which brings me back to cat
: ever wonder what the default nginx
image's root configuration file looks like? Wonder no more!
$ docker run --rm nginx cat /etc/nginx/nginx.conf
user nginx;
worker_processes 1;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
#tcp_nopush on;
keepalive_timeout 65;
#gzip on;
include /etc/nginx/conf.d/*.conf;
}
A shell redirect later, and I'm munging that default configuration in vim and getting ready to pop out yet another proxied HTTP(S) REST API, containerized.
I do a lot of development on my local machine. All of the projects that I'm currently working on have at least one semi-complicated auxiliary system with which they interact: PostgreSQL, MariaDB, Mongo, Redis, etc.
In the olden days before containers, I used to run actual hardware to host these types of systems, and I would then connect my dev applications up to those spinning systems for testing, demoing, etc.
Now I use Docker Compose.
For each project I'm working on, I maintain at least one docker-compose.yml
that spins up all of my data service dependencies, with lots and lots of port-forwarding. Here's an example, for https://vaultofcardboard.com:
version: '3'
services:
pg:
image: postgres:12
ports: ["$VCB_PG_PORT:5432"]
environment:
POSTGRES_PASSWORD: foo
redis:
image: redis
ports: ["$VCB_REDIS_PORT:6379"]
This sets up my two data systems, Redis and PostgreSQL. They forward ports, so I can access them on loopback. Precisely which ports are forwarded to the canonical, in-container ports is deferred until later, allowing me to run lots and lots of different projects at once, without (a) port collisions or (b) hard-to-remember automatically allocated ports.
You'll also notice that this compose recipe lacks volumes of any sort. That's on purpose, and brings me to the next trick...
Most of my local work is cutting edge, experimental, push-the-envelope sort of work. Can we do X? What's the performance differential if we go with Y instead? How do we migrate data between these two versions of Z?
For most of that work, I don't care about persistence. In fact, if I've got schema setup scripts and data import / restore logic, I really don't want persistence. It just opens me up to a bunch of headache related to leftover data.
Instead, I spin containers with ZERO persistent volume mounts. When you run the postgres
image, you're supposed to mount something at /var/lib/postgresql/data
. Not me.
You see, if you don't bother mounting volumes, then a container recycle is enough to get you a brand new database, rip-roarin' and ready to go. No more mucking about with RDBMS transactions, cleanup scripts, or the like. With ephemeral containers, just bounce or recreate them with docker-compose
.
These days, my data systems live for, on average, about an hour. Tops.
I've been very happy with Docker. It hasn't replaced my operational workload (still primarily on Linux VMs / Kubernetes for that), but it has dramatically changed the way I test, develop, and evaluate software solutions.
]]>When I was looking to jump from my Apple lifestyle to a more bare-bones, Linux-centric world, I started casting about for replacements for my mainstays of macOS / iOS computing. The first thing on the replacement short-list: 1password.
Thankfully, there is already a decent password manager for UNIX: pass. It uses GPG (via gpgme
) and a handful of POSIX and near-POSIX utilities to manage an encrypted hierarchy of secrets, similar in spirit to Vault, LastPass, and 1password.
It even has clipboard integration!
So, I dockerized it. And quickly ran afoul of some peculiarities in X11, with respect to clipboard functionality.
My mental model of the ubiquitous clipboard goes a little something like this:
Indeed, I have always thought of the "clipboard" as an actual thing, that is separate from the clients. And in macOS / Windows, that seems to be precisely the case.
Not so, with X11.
X11 distributes the job of clipboarding to all involved applications. What I mentally call "the clipboard" is nothing more than a convention of naming and differentiating what X11 calls selections.
Here's what happens, under X11, when you copy something:
That is, the copying application never sends the data to the X11 server for safe-keeping. Instead, it informs the X11 server that it now owns the clipboard, and that anyone who wants to paste should talk to the application (semi-)directly.
Here's how paste works:
(I've elided some of the details; mainly that the pasting application and the copying application never communicate directly, but rather coordinate through asynchronous messages sent to the X11 server. The upshot is that we end up "communicating" peer to peer to get at the clipboard data.)
So, what happens if copying application goes away between the "copy" and the "paste"?
The clipboard disappears.
This completely shatters my mental model of the clipboard as a neutral, independent buffer that gets written to by copy operations, and read from by paste operations.
To the problem at-hand: the pass
utility lets you copy secret data to "the clipboard" (skeptical air quotes), and expires it after 45 seconds. Here's what really happens:
pass
retrieves the current contents of the clipboard (as a UTF-8 string) and holds onto it.pass
decrypts the secret and marks that data as the X11 clipboard selection.pass
forks off a subshell in the background that sleeps for 45 seconds and then re-constitutes the original clipboard contents by re-marking the old data as the current selection.This is all done using xclip
, which works around X11's reliance on the continued existence of the clipboard "owner" by:
Awesome. Works great in normal mode, where pass
is just executed under a long-running PID 1 (which will tend to the orphaned xclip
process until the machine shuts down).
Works less than great when I try to dockerize pass
itself.
In that case, Docker sets up a new cgroup environment, with its own PID namespace, with pass
playing the role of PID 1. When xclip
forks off to be "long-lived" (against, skeptical air quotes), we re-inherit it as a child process, and then summarily exit. At this point, Docker kills off all of the remaining child processes, and our clipboard vanishes into thin air.
That's why I wrote xclipd.
It's a small daemon that runs on the host and doggedly watches the X11 CLIPBOARD selection, immediately requesting copied data and then asserting control of the clipboard. This reifies clipboard management, bring it in-line with my neutral 3rd party mental model of clipboards.
If you really want to dig into how X11's clipboard management works, and you don't mind a little bit of C, check out https://www.uninformativ.de/blog/postings/2017-04-02/0/POSTING-en.html
]]>I've been spending a lot of time working on the next iteration of the proof-of-concept Tweed data services broker. In the version I demonstrated back in December, all the heavy lifting was done by little bits of code + configuration called stencils. All the logic to deploy a Redis instance, for example, lived in a directory called redis/standalone
, for instance.
This works great in theory -- whenever the broker has to do something to a Redis (standalone) instance, it can shell out and exec a small script which finds the stencil root and then executes other things, like kubectl apply
or bosh deploy
.
One of the biggest problems with this approach is all those pesky dependencies that exist outside of the stencil root: things like spruce
, jq
, and even kubectl
and bosh
. What happens if one stencil requires a newer version of jq
, and another requires and older version?
This type of dependency management has already been solved, and quite effectively too: OCI (Docker) images.
At its core, an OCI image is nothing more than a supporting root filesystem that contains all of the binaries, libraries, and base operational configuration files that a given process needs to properly execute.
If we re-design Tweed so that each stencil is an OCI image, we get a whole bunch of benefits:
docker build
, dive, and more are all available to us intrinsically.docker run
with the appropriate mounts, standard input, and environment variables set is all you need to exercise all the parts of a stencil. This can be formalized, tossed into a CI/CD solution (like GitHub Actions), and become a regular part of your test/build/release workflow.There's only one downside to using OCI images for parts of the application logic: carrying around a Docker daemon with you, everywhere Tweed goes.
The Tweed demo can be deployed to a Kubernetes cluster, today. Given a Service Account with appropriate cluster rights, it can then effortlessly turn around and start deploying new service instances to that cluster. We would like very much to retain that power / flexibility.
What we need is a way to run OCI images without a Docker daemon.
What we need is runc
.
runc
InstalledNote:runc
deals with Linux cgroups do you, and is at its core a Linux
containerization technology. As such, you'll only be able to use runc from
a proper Linux box. I haven't yet gotten around to testing this stuff on
Windows Subsystem for Linux (WSL). For best results, use something like
Vagrant if you don't already natively run on Linux.
The easiest way to install runc
is to grab the latest static binary from their GitHub Releases Page, then chmod
it to be executable (0755
works just fine) and pop it in a directory in your $PATH
:
$ curl -sLo ~/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc9/runc.amd64
$ chmod 0755 ~/bin/runc
$ runc -v
runc version 1.0.0-rc9
spec: 1.0.1-dev
runc
for a WhirlLet's start with a simple container: alpine:3
. With Docker, we can do things like this:
$ docker run alpine:3 ps -ef
PID USER TIME COMMAND
1 root 0:00 ps -ef
We want to figure out how to do that in runc
, and dispense with the orchestrating Docker daemon.
While runc
operates on OCI images it doesn't work with Docker registries. That means we can't docker pull
, and we need to have the OCI image bits locally accessible to us, for runc
to be able to spin up the container.
This is where you'll start to notice differences between Docker-isms and
container-isms.docker pull
is a Docker-ism.docker exec
is a
container-ism.
We could craft our own image bits (called a rootfs, in runc
parlance), but that's tedious and error-prone. Besides, we have a perfectly good image sitting inside of our Docker daemon; why not just extract the alpine:3
image and use it?
$ docker run -d --name alpine-for-runc alpine:3
«some hard-to-remember container uuid»
$ mkdir rootfs
$ docker export alpine-for-runc | tar -x -C ./rootfs
$ ls -l ./rootfs
total 0
drwxr-xr-x 84 jhunt staff 2688 Dec 24 10:04 bin
drwxr-xr-x 5 jhunt staff 160 Feb 25 08:26 dev
drwxr-xr-x 36 jhunt staff 1152 Feb 25 08:26 etc
... etc. ...
Docker doesn't let us extract image filesystems directly; it only allows us to export a container's filesystem. This turns out to be less of a blocker than it initially appears -- all we do is docker run
the image to get a container, and then export that. When we do so, we can pipe it directly into tar
. The -C
flag will change directory into the new rootfs/
directory before it starts extracting stuff. We'll give this directory to runc
when we start spooling up containers.
Now that we have the salient bits of the OCI image, it's time to start setting up execution parameters. runc
does this by way of a JSON configuration file, imaginatively named config.json
. You can write these by hand, or you can get one out of runc spec
and modify it to your liking.
We'll do the latter:
$ runc spec
$ ls -l config.json
$ sed -ie 's/"terminal": true/"terminal": false/' config.json
$ sed -ie 's/"sh"/"ps","-ef"/' config.json
Those last two sed
lines are a bit cryptic, but they boil down to a few simple, heuristic changes we need to make to the generated spec. By default, runc spec
creates a configuration that requires a terminal. We won't need an interactive pseudo-TTY device, so we can (and do) turn that off. We also want to swap out the default command sh
for our ps -ef
.
To run this, we'll sudo
as root and give it a whirl:
$ sudo runc run foo-$(date +%Y%m%d%H%M%S)
PID USER TIME COMMAND
1 root 0:00 ps -ef
Yay! It worked!
It's important to note here that the command to run inside of the container is specified in the config.json
file, and nowhere else. We don't specify it on the command-line for runc
, specifically. This means that whenever we want to change commands, we need a different config.json. Tools like Spruce and jq can be extra handy here.
Now we've got OCI images executing as containers, without needing a Docker daemon. Go ahead and shut down your Docker daemon and try the above command.
The last thing we want to do is drop that sudo
requirement. Ideally, we'd like to be able to run our containers as ourselves (or as whatever account Tweed is going to execute as) without relying on privilege or specific Linux capabilities.
runc
has a concept called Rootless Containers whereby regular users like you and I are allowed to set up their own cgroup namespaces. This relies on some fiddly in-kernel configuration bits, but most modern Linux distributions should have the CONFIGUSERNS
bit twiddled to the "yes" position. To find out (at least on my Ubuntu system):
$ grep CONFIG_USER_NS /boot/config-$(uname -r)
CONFIG_USER_NS=y
The difference between rootless containers and other (rootfull?) containers lies in their config.json
specs. To generate a rootless spec, use the --rootless
flag:
$ runc spec --rootless
Note: you'll have to remove or rename your existing config.json spec; runc will (thankfully) refuse to overwrite it.
If you're curious, here's the semantic difference between the rootless spec and the one we started with (my UID/GID is 1000/1000):
$ spruce diff root-config.json config.json
linux
- one map entry removed: + two map entries added:
resources: gidMappings:
│ devices: - containerID: 0
│ - access: rwm │ size: 1
│ │ allow: false │ hostID: 1000
uidMappings:
- containerID: 0
│ size: 1
│ hostID: 1000
linux.namespaces
- one list entry removed: + one list entry added:
- type: network - type: user
mounts
- one list entry removed:
- type: cgroup
│ source: cgroup
│ destination: /sys/fs/cgroup
│ options:
│ - nosuid
│ - noexec
│ - nodev
│ - relatime
│ - ro
mounts./dev/pts.options
- one list entry removed:
- gid=5
mounts./sys.options
+ one list entry added:
- rbind
mounts./sys.source
± value change
- sysfs
+ /sys
mounts./sys.type
± value change
- sysfs
+ none
With this new spec in place (don't forget to run the same sed
commands we had earlier for dropping the terminal requirement and changing out the command to run), we can now run our Docker-less container as ourselves:
$ runc run foo
PID USER TIME COMMAND
1 root 0:00 ps -ef
Success!
There are several different ways you can go with this new technique.
For Tweed, I'll be investigating what it takes to modify the current lifecycle shell scripts such that they execute into a specially-crafted OCI image to do their work. This will most likely result in a small-ish command-line tool that shoulders some of the burden of configuring container specs away from the caller.
For other projects, both current and future, you can use this technique for deferring some part of the computation set before you to external sources. For example, an image management API might have pluggable backends, controllable by the operators, that are implemented in terms of conversion processes inside of OCI images. Want to support WebP? Just code up a new Docker container, export it, and drop it on the API implementation. You can even use registries like Docker Hub as a distribution mechanism -- push to Docker Hub, pull from Docker Hub on a re-packaging box, export, then upload!
Having the ability to just execute OCI images at-will opens up a whole new world of cloud-native and container-native application design. I hope you find a use for it, and hit me up on Twitter if you do!
Happy Hacking!
]]>This morning, I was trying to wrangle our CI/CD pipeline for the Containers BOSH Release so that I could cut a 1.1.0 release and generally forget about the process of integration testing.
Our stock pipeline architecture for BOSH releases runs a deployment test by taking a manifest — either the example manifest or a CI-specific manifest — and deploying it to a BOSH director. If the deployment takes, the BOSH release is considered fit for a release.
To conserve space, we usually target a BOSH director running the Warden CPI, which just spins up Linux containers for each BOSH instance group. For most things, this is sufficient; BOSH releases rarely care about the infrastructure they are deployed on, after all.
But the Containers BOSH Release is a bit different. It runs Docker, which needs a whole slew of container-y things like cgroups to function. When I tried deploying it onto a containerized CPI, it blew up, as things on the cutting edge are wont to do.
Task 352 | 18:16:01 | Updating instance docker:
docker/b3ab4f49-2f05-4865-b46c-eb248ebded5b (0) (canary) (00:02:26)
L Error:
'docker/b3ab4f49-2f05-4865-b46c-eb248ebded5b (0)' is not running
after update. Review logs for failed jobs: docker, compose
Looking into the logs, I found that, at least under Warden, you don't have access to mount new things (including cgroup
hierarchies) under /sys/fs/cgroup
. It's just plain not allowed.
A simple workaround is to move the mountpoint elsewhere. Docker still finds the cgroups wherever they happen to be mounted. However, since this is a workaround specifically for Warden, I was hoping to find a way to limit the behavior to only occur there. Other IaaS deployments (vSphere, AWS, GCP, etc.) are perfectly happy to mount things where they belong.
Luckily, I found just what I was looking for in /var/vcap/bosh/etc/infrastructure
. On Warden, this contains the text warden
, making it dead easy to recognize a Warden CPI from a mile away:
if grep -q warden /var/vcap/bosh/etc/infrastructure ]]; then
echo 'Hello, Warden!'
# ... do warden-y things ...
else
# ...
fi
In the past, I've resorted to such tricks as the im
inwardendontdoX property_:
properties:
im_in_warden_dont_do_nfs: yes
or the just-ignore-the-failure trick, which is less than ideal.
Now, armed with /var/vcap/bosh/etc/infrastructure
, I can make IaaS-specific decisions to tailor my BOSH releases to the situation at hand.
My my, that is a provocative title for this blog, isn't it?
I've just finished up work on my latest experiment with BOSH: the Containers BOSH Release. As more and more of my life slips into Docker containers (on its way to Kubernetes), I find myself wanting to spend less and less time re-packaging stuff for BOSH. Lots of software already exists in Docker form, since that's a pre-requisite for deploying things on Helm (and K8s), which is all the rage.
The other day, I said to myself, “James, what if we just ran the Docker images on top of BOSH?” It seemed crazy at first, but the more I through it through (and the more I played with the implementation) the more sane and normal it became.
The premise of the Containers BOSH Release is simple: start with a docker-compose recipe, and run it on one or more BOSH VMs.
I like to start with easy examples, so we're going to spin up a single-node Vault instance to prove that this crazy Containers thing works.
We'll start with the simplest of manifest stubs:
---
name: vault
stemcells:
- alias: default
os: ubuntu-xenial
version: latest
releases:
- name: containers
version: latest
update:
canaries: 1
max_in_flight: 1
serial: true
canary_watch_time: 1000-120000
update_watch_time: 1000-120000
instance_groups:
- name: docker
instances: 1
azs: [z1]
vm_type: default
stemcell: default
networks: [{name: default}]
jobs:
- name: docker
release: containers
properties:
recipe:
# .... START HERE ...
Under the recipe
property, we'll just insert a bit of docker-compose:
# ... continuing on ...
recipe:
version: '3'
services:
vault:
image: vault
ports: ['8200:8200']
environment:
VAULT_API_ADDR: http://127.0.0.1:8200
VAULT_LOCAL_CONFIG: >-
{
"disable_mlock": 1,
"backend": {
"file": {
"path": "/vault/file"
}
},
"listener": {
"tcp": {
"address": "0.0.0.0:8200",
"tls_disable": 1
},
},
"default_lease_ttl": "168h",
"max_lease_ttl": "720h"
}
cap_add: [IPC_LOCK]
command: [vault, server, -config, /vault/config/local.json]
That's it. Toss that at your favorite BOSH director, and when it's all deployed, you should be able to access the Vault on port 8200.
(If you want the full manifest, download it from here)
$ bosh deploy -n vault.yml
$ bosh vms
Instance Process State AZ IPs VM CID VM Type Active
docker/b046a21f running z1 10.128.16.143 vm-a47c3bd5 default true
Let's target that with safe:
$ safe target http://10.128.16.143:8200 dockerized
Now targeting dockerized at http://10.128.16.143:8200
Since this is a new Vault, we're going to need to initialize it:
$ safe init
Unseal Key #1: df16cd701bde233c768cda6c20e214e640bc43cd1b81d977a983d5590dd2659a03
Unseal Key #2: fe8ea8ab8a22ef931d5338dd6f4f2f6932ffa22f6caa26fd30cd57e11ffe137260
Unseal Key #3: c6a872983488ae92e30bb0f74a1a2795978e247c502b4684e70189fe0ba2ad90c6
Unseal Key #4: 6c72063fcf8a9b82a7c72fa286d14f84c9e46c7e30d21d9040ebbfab7725740170
Unseal Key #5: 836f40ec1f7c34460fc86b7547caccc7ffa6b680b9a69a0205b1cddebcb33d2530
Initial Root Token: s.BoFveccftUE9y9j1p4WfyPpO
Vault initialized with 5 keys and a key threshold of 3. Please
securely distribute the above keys. When the Vault is re-sealed,
restarted, or stopped, you must provide at least 3 of these keys
to unseal it again.
Vault does not store the master key. Without at least 3 keys,
your Vault will remain permanently sealed.
safe has unsealed the Vault for you, and written a test value
at secret/handshake.
You have been automatically authenticated to the Vault with the
initial root token. Be safe out there!
There you go, a Vault! And we didn't have to write our own BOSH release.
Let's try a bit more complicated of an example, shall we?
The SHIELD docker-compose recipe consists of five different containers that work together to provide all the moving parts necessary to evaluate SHIELD's effectiveness as a data protection solution:
Despite all of this complexity, deploying to BOSH via Containers is just as straightforward — just drop the docker-compose.yml
file contents (properly indented, of course) under the recipe:
property of the docker
job.
If you want, you can read the entire BOSH manifest.
Once that is deployed, I used bosh vms
to get the IP address of the deployed VM, and then targeted that IP (on port 9009) with the SHIELD CLI:
$ bosh vms
Instance Process State AZ IPs VM CID VM Type Active
docker/d9b2c11d running z1 10.128.16.144 vm-add4d02e default true
$ shield api http://10.128.16.144:9009 docker-shield
docker-shield (http://10.128.16.144:9009) OK
SHIELD DOCKER
$ shield -c docker-shield login
SHIELD Username: admin
SHIELD Password:
logged in successfully
$ shield -c docker-shield status
SHIELD DOCKER v8.2.1
API Version 2
If you want, you can head on over to the SHIELD web UI (also on port 9009).
Anchore is a security scanning solution for Docker images. You give it an image URL and it will pull that image down, unpack it, and scan it for known CVEs and other vulnerabilities. As more of my F/OSS projects move to delivering OCI images as release assets, I wanted a solution that could scan (and re-scan) those images quickly and painlessly.
(by the way, according to their Slack org, Anchore rhymes with encore).
It's a neat system, and its canonical deployment is via a docker-compose recipe, making it a perfect fit for this new mental model of deploying to BOSH.
To deploy, I started with the upstream docker-compose recipe, and then tweaked it slightly (mostly by renaming the Docker containers). The final manifest is here. Go ahead and deploy it; we're going to segue briefly into some theory, but we'll come back to Anchore soon enough.
The Containers BOSH release does the following for you:
docker-compose
docker-compose.yml
file on the BOSH VM, based on what you put in the recipe
property.docker-compose up
from a script that monit
(BOSH's supervisor) babysits for you.That's (virtually) all there is to it. Don't let the simplicity deceive you! This straightforward formula unlocks some serious potential in using BOSH to run your (already) dockerized workloads. More importantly, it frees us from needing to figure out how to package and execute upstream software.
All the stability of BOSH, all of the flexibility of Docker!
The first thing our docker-compose up
is going to do is contact some registry somewhere and pull down the images it needs to run the project. Depending on your environment, this may either be acceptable (I hope you are pinning your image tags), or it may not. In "air gapped" environments, you cannot directly download anything from the public Internet and run it, for security reasons.
That pretty much rules out this new thing, right?
Wrong. I saw this edge case a mile away — I work with lots of environments that are either forced to use semi-broken HTTP(S) proxies, or are simply not allowed out to the Internet.
After the Docker daemon boots, but before we run docker-compose up
, we scan the disk looking for other BOSH jobs that might be able to provide us with the raw material of OCI-compliant images: layered tarballs.
The code in question looks a little something like this:
for image in $(cat /var/vcap/jobs/*/docker-bosh-release-import/*.lst)
do
docker load <$image
done
Any job that gets co-located on the instance group has the option of defining a list of paths to exported OCI image tarballs, that we will load into our local Docker daemon.
So yes, if you want to run in an air gapped environment, you do still have to write BOSH releases, but they are super simple and require almost no thought. I even wrote a proof-of-concept image release that packages up the Vault image we've been playing with.
docker exec
?One of my favorite features of Kubernetes is kubectl exec
; being able to just bounce into a container and poke around to verify things is amazingly powerful.
You can imitate this powerful feature by binding the Docker daemon to a TCP port. Normally, Docker just binds a UNIX domain socket (like a port, but its a file). This provides some level of protection for Docker, since you have to be on the Docker host to see the socket file, and you have to have the appropriate group memberships to be allowed to write to it.
If you set the bind:
property of the docker
job to a TCP port number, the Docker daemon will also listen on that port, across all interfaces, for inbound control messages. You can combine this with the -H ...
flag to the docker
CLI utility, or, better yet, the $DOCKER_HOST
environment variable.
If you look closely at the Anchore BOSH manifest, you'll notice that it binds port 5001, letting us do this:
$ bosh -d anchore vms
Instance Process State AZ IPs VM CID VM Type Active
docker/2e016a81 running z1 10.128.16.142 vm-ffbfedd1 default true
$ docker -H 10.128.16.142:5001 ps
CONTAINER ID IMAGE STATUS PORTS NAMES
47fb54c4b539 (anchore) Up (healthy) 8228/tcp running_simpleq_1
16addb64383e (anchore) Up (healthy) 8228/tcp running_policy-engine_1
c56b06c1ce10 (anchore) Up (healthy) *:8228->8228/tcp running_api_1
f0ab474b62fc (anchore) Up (healthy) 8228/tcp running_analyzer_1
3c46e9c70e3c (anchore) Up (healthy) 8228/tcp running_catalog_1
c96f8a5b9596 (postgres) Up 5432/tcp running_db_1
Since we have complete access to Docker, let's try an exec
:
$ export DOCKER_HOST=10.128.16.142:5001
$ docker exec -it running_api_1 /bin/bash
[anchore@c56b06c1ce10 anchore-engine]$
Success! From here, you can use the embedded anchore-cli
to interact with the scanning solution:
[anchore@c56b06c1ce10 anchore-engine]$ anchore-cli system feeds list
Feed Group LastSync RecordCount
vulnerabilities alpine:3.3 2019-06-13T01:23:20.228251 457
vulnerabilities alpine:3.4 2019-06-13T01:23:32.916373 681
vulnerabilities alpine:3.5 2019-06-13T01:23:49.301987 875
vulnerabilities alpine:3.6 2019-06-13T01:24:10.832360 1051
vulnerabilities alpine:3.7 2019-06-13T01:24:38.684207 1125
vulnerabilities alpine:3.8 2019-06-13T01:25:08.305060 1220
vulnerabilities alpine:3.9 2019-06-13T01:25:39.714090 1284
vulnerabilities amzn:2 2019-06-13T01:26:02.169712 178
vulnerabilities centos:5 2019-06-13T01:27:20.146913 1323
vulnerabilities centos:6 2019-06-13T01:28:43.593786 1333
vulnerabilities centos:7 2019-06-13T01:29:58.219342 793
vulnerabilities debian:10 2019-06-13T01:37:15.136873 20352
vulnerabilities debian:7 2019-06-13T01:45:14.279082 20455
vulnerabilities debian:8 2019-06-13T01:53:57.173538 21775
vulnerabilities debian:9 2019-06-13T02:02:19.820286 20563
vulnerabilities debian:unstable 2019-06-13T02:10:43.437588 21245
vulnerabilities ol:5 2019-06-13T02:12:14.888089 1233
vulnerabilities ol:6 2019-06-13T02:14:05.563999 1417
vulnerabilities ol:7 2019-06-13T02:15:33.553661 915
vulnerabilities ubuntu:12.04 2019-06-13T02:20:54.345817 14948
vulnerabilities ubuntu:12.10 2019-06-13T02:22:54.519832 5652
vulnerabilities ubuntu:13.04 2019-06-13T02:24:15.630364 4127
vulnerabilities ubuntu:14.04 2019-06-13T02:30:58.521151 18693
vulnerabilities ubuntu:14.10 2019-06-13T02:32:41.325661 4456
vulnerabilities ubuntu:15.04 2019-06-13T02:34:46.806824 5789
vulnerabilities ubuntu:15.10 2019-06-13T02:37:12.849533 6513
vulnerabilities ubuntu:16.04 2019-06-13T02:43:25.005589 15795
vulnerabilities ubuntu:16.10 2019-06-13T02:46:17.313668 8647
vulnerabilities ubuntu:17.04 2019-06-13T02:49:16.106250 9157
vulnerabilities ubuntu:17.10 2019-06-13T02:51:51.897617 7935
vulnerabilities ubuntu:18.04 2019-06-13T02:55:06.965036 10047
vulnerabilities ubuntu:18.10 2019-06-13T02:57:35.083799 8134
vulnerabilities ubuntu:19.04 2019-06-13T02:59:23.443062 6586
etc.
In a perfect world, we'd all use TLS everywhere, and everyone would have a verifiable chain of trust back to a CA root authority.
Ha!
Sometimes, you can't help it that your Docker Registry is protected by a self-signed certificate, or one whose CA isn't in the system roots. Occasionally, you have to go without transport security altogether and run on plain old HTTP.
For that, Docker supports the concept of an insecure registry, and the Containers BOSH release lets you supply a list of those ip:port
endpoints that just won't pass muster on X.509 validation:
jobs:
- name: docker
release: containers
properties:
insecure-registries:
- docker.corp.int:3000
That way, if you have to pull any images from those registries as part of your compose recipe, you're covered.
That one we're still working on, but given that we can now trivially spin our own Docker Registries, using this new BOSH release and upstream Docker images, I expect that will get fixed soon enough. You might even be the one to PR that!
I hope my experiments here have piqued your interest. Go out, snag a copy of the latest release, and deploy your favorite Dockerized workload, now with the power of BOSH!
Oh, and if you run into any problems along the way, or find a way to improve the BOSH release, we hope to hear from you.
Happy Hacking!
]]>I spend a surprising amount of time pawing through process tables on various UNIX boxes. The biggest problem I run into is that the grep
process itself generally shows up in the process table while the search is running, so it gets picked up. The downside there is that grep still finds stuff, even if the process doesn't exist!
→ ps -ef | grep 12345
501 71075 70391 0 10:49AM ttys015 0:00.00 grep 12345
See? There is no process 12345, but there is a grep for it.
The naive solution (which I used for years) is to grep out the grep:
→ ps -ef | grep 12345 | grep -v grep
It works (unless process 12345 is itself a grep
), but it feels bad.
A while back, I picked up a technique that uses the power of regular expressions:
→ ps -ef | grep 1[2]345
This works because [2]
in grep's implementation of regular expression is a character class match; it matches any character in the set '2' (which is just a literal '2'). However, in the process table entry for the searching grep
, the literal command line is 1[2]345
, which doesn't match.
I call this "The [P]ID Trick" because I usually employ it to find processes by their process ID (almost always from a pidfile somewhere).
You can put the brackets anywhere. You don't even have to be looking for a process ID! This finds all SSH processes (client and server):
→ ps -ef | grep s[s]h
Happy Hacking!
]]>There's been a lot of talk over the past few years about the relevance of Cloud Foundry and BOSH in the face of that titan of container runtimes, Kubernetes. I'd like to throw my predictions out there, and hopefully articulate my vision of whither BOSH, whither Kubernetes.
BOSH is, at its core, a VM orchestration engine. It operates on an impressive selection of clouds / IaaSes, including the big ones: AWS, Azure, GCP, vSphere. It creates VMs, provides them with software and configuration, and supervises them. If any VMs crash or go missing, BOSH recreates them.
Cloud Foundry currently sits on top of BOSH (it's packaged exclusively for BOSH, and canonically runs as a bunch of VMs). It provides what we call the cf push story.
It goes like this:
$ cd ~/code/killer-app
$ cf push
Yup. That's it. The power of Cloud Foundry is that deploying applications really can be that simple. Even with persistent data services, custom domains, path-based routing, and raw TCP requirements. cf push
.
Under the hood, Cloud Foundry has its own container runtime, called Diego, which supervises and scales application instances (really: containers). If any application instances crash or go missing, Diego recreates them.
Kubernetes is the darling of the container world, and provides a large and flexible toolkit for making containers and OCI-compliant images, like those created by docker build ...
, into a usable system, built on pods (really: containers) If any pods crash or go missing, Kubernetes recreates them.
So we've got three supervisors, each tasked with keeping shit running. Two of them leverage containers, and one deals almost exclusively in VMs.
Do I use BOSH? Do I create a BOSH release for the software and deploy it via my favorite director on my chosen IaaS?
Do I use Cloud Foundry + Diego, and package my software up in a form that can be containerized for me (via a buildpack)? Do I just push a Docker container into CF?
Do I use Kubernetes? Do I package up my software in Docker images, and write a Helm chart to handle the nitty-gritty of rollout?
There doesn't seem to be a lot of consensus on where to go from here. Here's what I've seen:
Cloud Foundry holds to the 12-factor path, one tenet of which is thou shalt not keep state. After all, you can't scale a node that has to replicate its local state across an ever-growing cluster. In CF land, your persistent data goes in your services.
Under this approach, Kubernetes is where you deploy your services, since Kubernetes can deal with state. Everything is still a container, so operators get to multiplex lots of tenants onto a shared set of machines. The advanced resource- and CPU-share capabilities of Kubernetes help to mitigate noise-neighbor problems (which often drove us to wanting on-demand, dedicated VMs for our services).
There are two downsides to this:
As near as I can tell, this is the entire strategy of CFCR, and its commercial big brother, PKS. Use BOSH to deploy a Kubernetes cluster. This gets you all of the value of the BOSH lifecycle - VM replacement, semi-immutable VM deployments, and IaaS-agnosticism.
The main problem I see here is that Kubernetes is still a walled garden, separate from Cloud Foundry's Diego.
Diego is a container runtime. Kubernetes is a container runtime. As they say in old Westerns, this town ain't big enough for the both of us. As an operator, I'd really rather have one container runtime, that everyone gets to use.
That's what project Eirini from SuSE/IBM is setting out to do. Since runtime / container orchestrators are commodity, why not unhitch Diego from Cloud Foundry's API and allow system designers to swap in Kubernetes? (or Swarm? or Nomad?)
This neatly solves the waste problem of having more than one runtime. Cloud Foundry itself (the routing layer, UAA, database, API, etc.) is all still deployed via BOSH, and you can do what you want with services. I'd wager, however, that more operators and service integrators will move to Kubernetes for the sheer audience size, and abandon BOSH for non-CF things.
This one is a bit odd, but there is a project out there on GitHub that provides a BOSH Cloud Provider Interface for Kubernetes. BOSH uses the CPI abstraction to deal with an idealized IaaS, and uses CPI plugins to adapt to different real-world clouds. There is an AWS CPI, for example, and one for OpenStack.
The Kubernetes CPI for BOSH translates VM operations into Pod operations.
This would work, and probably work well, were it not for the fact that BOSH is predisposed to thinking in terms of virtual machines. Since VMs take a while to provision, BOSH does in-place upgrades. In a purely container solution, this is undesirable, as it breaks the immutable containers promise.
I'm an architect by trade, so I live in the future where everything works and nothing is broken. Here's where I see us going:
Let's let BOSH do what BOSH does best: provision and maintain virtual machines. It's really good at it, and it has lots of experience. I have other tools at my disposal that build on top of the BOSH paradigm, like Genesis. Those continue to work really well for the Kubernetes clusters themselves.
All the other bits of the infrastructure that I'm used to get deployed on top of Kubernetes.
SHIELD becomes a Helm chart.
My monitoring system is a set of containers, either in situ with the monitored workloads, or off-cluster (in another k8s).
Blacksmith is both deployed on Kubernetes, and also deploys service instances to it.
Concourse no longer has workers, pipeline jobs are one-off tasks inside the k8s cluster.
I don't deploy any Diego cell VMs via BOSH. I don't deploy Diego cell containers and nest my containerization. Cloud Foundry (via Eirini or something like it) is scheduling application instance containers directly onto the Kubernetes substrate.
BOSH and Kubernetes play important and complementary roles in the modern cloud data center. Let's let BOSH deal with the VMs and let Kubernetes handle the containers.
In the end, I think we'll all be better off.
]]>Here is a cheat sheet for the tmux.conf provided by jhunt/env on GitHub.
I use Ctrl-a instead of Ctrl-b. I also map Caps Lock to be Ctrl instead, so Ctrl-a is two neighboring keypresses, which is a lot easier on my fingers. I type professionally, and a lot of that typing is done form inside tmux, so this small change has substantial impact.
This is what my tmux
looks like.
Ctrl-a + d | Detach from the current tmux session, leaving it running in the background. You can log off of a server after this. |
Ctrl-a + Ctrl-n | Focus the next window in this session. I usually just hold down the Ctrl key for this. |
Ctrl-a + Ctrl-p | Focus the previous window. I also just hold down Ctrl while moving from a to p. |
Ctrl-a + x | Kill (terminate, with prejudice) the current window. Useful if a program won't respond to signals. Remember to let up on Ctrl before hitting x. |
Ctrl-a + " | Open the window list, allowing you to select windows (with preview!) |
Ctrl-a + - | Horizontally split the current pane, focusing on the bottom pane. The hyphen sort of looks like it could divide the window across the middle... |
Ctrl-a + | Vertically split the current pane, putting the new pane on the right. is just a | without Shift held down, and | looks like it divides the window. A bit of a stretch, I know, but it's a serviceable mnemonic. |
Ctrl-a + ← Ctrl-a + h | Move left to the pane next to this one. |
Ctrl-a + ↓ Ctrl-a + j | Move down to the pane below. |
Ctrl-a + ↑ Ctrl-a + k | Move up to the pane above. |
Ctrl-a + → Ctrl-a + l | Move right to the pane next to this one. |
Ctrl-a + [ | Enter copy mode. |
Ctrl-u | Scroll up by a page. |
Ctrl-d | Scroll down by a page. |
← ↑ ↓ → | Move around in the buffer. |
Space | Start a copy highlight at the cursor. Exits copy mode. |
Enter | (after a Space and some movement commands) copy the highlighted text to the paste buffer. |
Ctrl-a + ] | Paste the contents of the paste buffer to the terminal. Often useful from within |
Ctrl-c | Exit copy mode. Note the lack of Ctrl-a prefixing this particular key-sequence. |
Ctrl-a + Shift-s | SSH somewhere, using the name or IP typed in at the |
Ctrl-a + : | Enter tmux command mode, which opens up a prompt on the status bar, where you can enter any tmux command. See tmux(1) for what you can do from here. |
I've always been the kind of person to do-it-myself. I designed my own computer games in grade school, built my own servers in high school, and I'm even in the process of writing my own programming language.
So when it came time to start doing some serious day-to-day work with Docker, on my Macbook, I started looking into the nuts and bolts of running a Linux container platform on the decidedly non-Linux Darwin kernel.
The simple fact is, you can't. You cannot run Linux containers on Darwin. That's not how this works. You can only run Linux containers on Linux kernels.
The easiest way to get up and running is to use something like Docker for Mac. It takes care of everything for you, spinning up a custom Linux virtual machine using macOS's lightweight Hyperkit / xhy.ve virtualization framework. Docker for Mac handles all of the wiring for you, so that when you execute docker images
in your macOS terminal, your client contacts the docker daemon executing inside the Linux VM.
Neat.
Waving aside some of the many reported performance issues with Docker for Mac + Hyperkit, there's the simple fact that I already run a Linux Vagrant instance which I use for other things. I'd really rather not incur the cost of running two of them.
As it turns out, the wiring to get a macOS user-space docker client hooked up to a Linux kernel-space docker daemon is pretty straightforward.
My Vagrant box is Ubuntu-based (Xenial, 16.04 LTS), and Docker is configured out of the box to only bind the UNIX domain socket, for zero-conf communication with local docker
clients only. Luckily, we can change that.
Xenial also means systemd, and these steps should apply nicely to Bionic (18.04 LTS). First, we need to edit the docker systemd .service
file:
$ sudo vim /lib/systemd/system/docker.service
The line we're interested in is ExecStart=...
; my service file looks like this:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target docker.socket
Requires=docker.socket
[Service]
Type=notify
ExecStart=/usr/bin/docker daemon -H fd:// -H tcp://0.0.0.0:9098
harbor.escalon.cf-app.com
MountFlags=slave
LimitNOFILE=1048576
LimitNPROC=1048576
LimitCORE=infinity
[Install]
WantedBy=multi-user.target
The additional -H
flag will cause the docker daemon to bind another interface. Here, I've chosen TCP port 9098, on all interfaces.
Once the .service
file is all fixed up, we need to inform systemd, via a daemon-reload (to pick up on the fact that the file changed) and a restart (to actually make the changes real):
$ sudo systemctl daemon-reload
$ sudo systemctl restart docker
To verify, you can either grep the process table:
$ ps -ef | grep docker
root 1935 1 0 13:05 ? 00:00:02 /usr/bin/docker daemon -H fd:// -H tcp://0.0.0.0:9098
... or check netstat:
$ sudo netstat -tlnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1937/sshd
tcp6 0 0 :::9098 :::* LISTEN 1935/docker
tcp6 0 0 :::22 :::* LISTEN 1937/sshd
Good to go!
Next up, we need to be able to access the TCP port that docker is listening on inside the Linux VM, from outside the VM.
If you're running Vagrant like I am, you can just add this to your Vagrantfile
:
Vagrant.configure('2') do |config| config.vm.box = 'jhunt/vagabond' config.vm.box_version = '1.0.2' config.vm.synced_folder ".", "/vagrant" for i in 9000...9099 config.vm.network :forwarded_port, guest: i, host: i end end
For these changes to take effect, you will have to restart your Vagrant instance. I find that port-forwarding a whole block (the 90xx block) makes my life easier later on down the road.
.${SHELL}rc
The last piece of the puzzle is to direct the docker
client on the Mac to use the forwarded port, instead of a local UNIX socket. The DOCKER_HOST
environment variable governs that, so I added this to my ~/.bashrc
:
export DOCKER_HOST=tcp://127.0.0.1:9098
Now (after sourcing ~/.bashrc
, or opening a new terminal), I can use docker on my mac!
$ uname -s
Darwin
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nginx latest e2c463314119 2 weeks ago 0B
ubuntu latest c6f8c325f4ca 8 weeks ago 0B
alpine latest 189d5ae0f1aa 4 months ago 0B
I've been using this setup for a week or so now, and it works really well. Most of the time, I forget I'm even bouncing through the Linux VM, it's such a seamless experience.
One thing I have noticed is that docker images
(as shown above) does not report the image sizes appropriately. I'm not sure why this is, but other than being an oddity, it hasn't caused much grief.
Happy Hacking!
]]>So the other day, I started writing braid. It's an implementation of the HTTP RFCs (723x). You see, I want to write web APIs in C.
I know, I know, I should use a more modern language. Or a memory-safe language. Or some other such nonsense. I have been in this industry long enough, I shall do as I damn well please. Besides, GitHub is big enough for all of us.
I started in on the guts of RFC 7230, the message syntax, and banged out a parser with a sliding window buffer. It worked pretty well, and was quite flexible, if I do say so myself. The design hinges upon the idea of strands, the small bits of a string that we can read with only a small, fixed-size buffer to range over.
Consider a minimalist HTTP request:
GET /some/path HTTP/1.1⏎
Host: localhost⏎
Accept: */*⏎
⏎
That's less than a hundred bytes. It's super easy to parse, too: just read the whole thing into a 1k buffer and go to town.
But what happens if some jerk out on the internet sends you a request like this:
GET /some/path HTTP/1.1⏎
X-Evil: «10,000 "HA"'s in a row»
Host: localhost⏎
Accept: */*⏎
⏎
That's definitely not going to fit in our 1024-byte buffer. Or our 8192-byte buffer. Or a 64k buffer. At some point, if we keep making the buffer bigger, the world cranks out a bigger prankster trying to crash our software, looking for a laugh at best, or an RCE at worst.
What the braid parser does is keep track of a single strand, which contains the chunk of a string we've seen so far. When it runs out of buffer, but hasn't finished parsing a request, the parser measures off enough memory, copies out the parsed data, and appends it to the current strand.
It looks like this:
I wrote a small test-suite that exercised various edge cases for this stranding design, varying the length of the URI from 100 characters to 19,000 characters, and then doing the same for a header value, a header name, etc.
All Tests Passed.
A beautiful feeling. But this is an essay about troubleshooting, by gum, and for that, you need trouble. My trouble started when I noticed the off-by-one error that lead me to stop at 19,000 characters. One s/</<=/
later, and now my tests are failing.
They're not just failing. They're segfaulting.
Now I've been doing all of my development on my new MacBook 13", 2017 edition (without the Touch Bar thank you very much). Say what you will about developing software on macOS, but I'll say this: the debugging facilities outright suck. No Valgrind. No American Fuzzy Lop. GDB requires some crazy workarounds.
So I take a quick detour to a random Linux cloud host that I keep around because sometimes it beats a sharp stick in the eye. There, arrayed before me (heh) are all my beautiful debugging tools; implements of master craftsmanship.
So I fire up Valgrind.
valgrind ./msgtest </test-data/braid.test.cF9n4TI
... a bunch of boring stuff omitted; you're welcome ...
==24161== Invalid read of size 8
==24161== at 0x400B41: append (msg.c:129)
==24161== by 0x40163A: advance (msg.c:378)
==24161== by 0x401A62: main (msg.c:462)
==24161== Address 0x52043d0 is 4 bytes after a block of size 12 alloc'd
==24161== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24161== by 0x400C60: append (msg.c:144)
==24161== by 0x401364: advance (msg.c:325)
==24161== by 0x401A62: main (msg.c:462)
==24161==
Okay, so that's bad. Apparently, my little test program is trying to tromp all over someone else's memory. This is an out-of-bounds memory access violation, which is just the kind of thing you expect Valgrind to find. Yay!
So now I just need to figure out who is setting the pointer to the errant address. I tracked the problem down to the next
pointer on the strand implementation, and it only seems to manifest when we have to append the second strand.
As I am a proud printf debuggerer, I found all the places in the code where the next
pointer is set, nullified, referenced, or even thought about, and I put in copious amounts of printf("next: %pn", x->next)
calls. Fire up Valgrind again. Get the same error about an invalid read, 4 bytes into another variable's back yard. Same address too! But...
But. But none of the printf statements ever printed the value being accessed. That 0x52043d0
pointer address is literally only printed out by Valgrind itself. And Valgrind has no bugs, so this is one hell of a thinker.
Anyhow, it's time to go to my day job, so I switch off of that terminal tab and focus my mind on lesser demons like Go, and avoiding Slack, and tracking down people for pairing sessions.
The back of my head, however, can't stop working on this stupid problem. 0x52043d0
is slowly, maddeningly burning a hole into my brain. Teasing me. Taunting me. Tormenting me.
... fast-forward 8-or-so hours ...
Back to this.
I need to look at memory. Clearly something went wrong somewhere, and some pointer got munged. When you need to look at arbitrary memory locations, you need GDB
.
gdb ./msgtest
gdb> run <t/test-data/braid.test.cF9n4TI
... a bunch more boring stuff omitted; you're welcome ...
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400b41 in append (st=0x604000, a=0x7fffffffdd10 'A' <repeats
200 times>..., b=0x7fffffffe110 "") at src/msg.c:129
129 fprintf(stderr, "ins->next->next=%p\n", ins->next->next);
Yup. That's a segfault. Looks like it's actually dying on one of my printf()
s. Time to look at this ins
character (that's the insertion point on the strand, if you're curious).
(gdb) print ins
$1 = (struct strand *) 0x604000
Right. A pointer. Gotta dereference it.
(gdb) print *ins
$2 = {len = 0, data = 0x31 <error: Cannot access memory at address
0x31>, next = 0xb}
Now that IS interesting. The next
pointer is 0xb
; not quite NULL
(that would be 0x0
), but definitely not high enough to be an actual pointer. Also, we have a data pointer of 0x31
, which is also suspiciously low in the address space, and a len
of 0 (no bytes in the strand).
All of this points to some sort of memory corruption. But where? Valgrind didn't indicate that anything else was off. These errors are all occurring when we append()
to the parser's strand.
On a whim, I set a breakpoint in advance()
, start the process over again, and when the breakpoint fires, I set a hardware watch on p->strand
.
(gdb) break advance
Breakpoint 1 at 0x400db0: advance. (2 locations)
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/jhunt/code/braid/msgtest
<t/test-data/braid.test.cF9n4TI
Breakpoint 1, advance (p=0x7fffffffdd00) at src/msg.c:206
206 if (!p->req)
(gdb) watch p->strand
Hardware watchpoint 2: p->strand
The lovely thing about hardware watch points is that GDB traps all changes to that memory address, and will dump us back into the debugger whenever they occur. For convenience, I cleared my breakpoint (don't need it anymore), and continue on.
(gdb) clear advance
Deleted breakpoint 1
(gdb) continue
Continuing.
... more program output, mostly my annoying printf() debugging ...
Hardware watchpoint 2: p->strand
Old value = (struct strand *) 0x0
New value = (struct strand *) 0x604060
advance (p=0x7fffffffdd00) at src/msg.c:250
250 if (c == end)
Excellent. p->strand
changed from 0x0
(no strand) to 0x604060
, some bit of heap-allocated memory. Nothing unusual here; continuing on.
(gdb) continue
Continuing.
... more program output, mostly my annoying printf() debugging ...
Hardware watchpoint 2: p->strand
Old value = (struct strand *) 0x604060
New value = (struct strand *) 0x0
advance (p=0x7fffffffdd00) at src/msg.c:254
254 c += 1;
This is just the parser weaving the strands back together into a string and disposing of the (now unneeded) strands. Perfectly normal. GDB and I continue in this vein for a while, back and forth, NULL → !NULL, !NULL → NULL until a very curious thing happens.
(gdb) continue
Continuing.
and we're out of bytes. sliding the buffer.
Hardware watchpoint 2: p->strand
Old value = (struct strand *) 0x604060
New value = (struct strand *) 0x604000
advance (p=0x7fffffffdd00) at src/msg.c:224
224 end = p->buf + p->used;
Weird. The "and were out of bytes..." bit is the program indicating that it hit the end of the fixed-size read buffer, and is about to append to the strand. Normally, appending to the strand leaves p->strand
alone (append happens at the end, modifying the terminus ->next
pointer). But here, we've ... we've assigned an interesting value to the strand pointer.
It's exactly 96 bytes before the value it used to have.
That's weird for a variety of reasons. For starters, heap allocation usually works from lower to higher memory addresses as it chugs through the arena or the slab (depending on the allocator). Also, 96 bytes doesn't seem to match the sizeof()
any of my structures (I checked).
On a whim (and perhaps to keep my fingers busy while my brain idled), I ran:
(gdb) list 224
219 p->used += nread;
220 if (p->used == 0)
221 return -1;
222 p->buf[p->used] = '\0';
223
224 end = p->buf + p->used;
225 c = p->buf + p->dot;
226
227 again:
228 switch (p->state) {
And stared at the output for a good five minutes. Realization dawns on me slowly, as I then decide to look at the parser structure itself, *p
:
(gdb) print *p
$3 = {flags = 0, dot = 0, used = 1024, fd = 0,
buf = 'A' <repeats 1024 times>, strand = 0x604000,
state = 6, req = 0x604010, header = 0x604120}
Everything clicks into place. The devious machinations and lewd gyrations this execution path has made come to the forefront of my mind. I dare say I heard a choir of angels sing out in a Hallelujah! of buffer overruniness.
See p->buf
is the fixed-size buffer window. It's right next to p->strand
in the memory layout, and because it's a nice round number (1024, 4096, 8192, all of them nice round numbers), there's no structure padding to maintain alignment.
The memory looks something like this:
|.----- buf -----.|.--------- strand --------.|
+---+---+-----+---+------+------+------+------+
| A | A | ... | A | 0x60 | 0x40 | 0x60 | 0x00 |
+---+---+-----+---+------+------+------+------+
Since this is an x86-64 chip, everything is little-endian, so 0x604060
(the previous value of p->strand
is stored least-significant byte first, or 60 40 60 00
.
Refer back to the last code listing we did. Notice that line 222 explicitly sets the p->used
'th cell of p->buf
to 0
, which is a fancy string-like way of writing 0x00
.
So we go from:
|.----- buf -----.|.--------- strand --------.|+---+---+-----+---+------+------+------+------+| A | A | ... | A | 0x60 | 0x40 | 0x60 | 0x00 |+---+---+-----+---+------+------+------+------+
to:
|.----- buf -----.|.--------- strand --------.|+---+---+-----+---+------+------+------+------+| A | A | ... | A | 0x00 | 0x40 | 0x60 | 0x00 |+---+---+-----+---+------+------+------+------+
overwriting the least significant byte of our strand pointer. The next time we try to deal with the strand, we're going to fall short of 0x604060
, and land in 0x604000
, thanks to that misplaced NULL terminator.
Once I figured that out, I did a little bit of digging, scanning back up through the code listing:
(gdb) list 217
212 nread = read(p->fd,
213 p->buf + p->used,
214 PBUFSIZ - p->used);
215
216 if (nread < 0)
217 return -1;
218
219 p->used += nread;
220 if (p->used == 0)
221 return -1;
Sure enough, there's the culprit. When we read()
into the buffer, we fill it completely, using up every ounce of space. Then we advanced p->used
forward, exactly the number of bytes we read. If there's more to be read than buffer space, p->used
will EQUAL PBUFSIZ
, which means that line 222 tries to assign the terminator beyond the end of the buffer.
(Admittedly this is why people discourage implementing much of anything in C. Personally, I live for this stuff.)
Mystery solved. A well-meaning bit of code to properly NULL-terminate a string buffer (probably for debugging or something) ran afoul of assumptions of previous code, and ended up munging a nearby pointer that was minding its own business.
Were it not for those hardware watch points in GDB, and Valgrind pointing to possible memory corruption, this bug could have entailed an excruciating line-by-line review of the codebase.
Happy Hacking!
]]>This morning I set out to get SLIMV working, for a more integrated (Common) Lisp editing experience, with a tighter edit-compile-debug loop.
Now I'll admit, I've done bad and terrible things to every vim installation I've ever spent more than a few days with. Installing files into $VIMRUNTIME
, hamfistedly hacking up other peoples plugins and scripts without taking the time to figure out vimscript, etc. For these sins I feel bad (but not bad enough to actually learn vimscript).
So when I went to install the SLIMV bundle, I did what I always do: muddle through bits and pieces of installation docs until I can make the damn thing work.
This time, however, I wrote down some of the ways I debug this terrible mess I've gotten myself into, and I wanted to share it with you, dear reader.
Good programmers use debuggers. Great programmers (and most sysadmins) use print debugging. Vim does lots of magical stuff at startup, and most of the time when that magic breaks (or "breaks") you get no output.
Enter echomsg
(also called echom
by chronic and unapologetic abbreviators). I use this one all the time:
" Only do this when not done yet for this buffer
echom "jhunt: i'm trying to load this plugin, here goes..."
if exists("b:did_ftplugin")
echom "jhunt: looks like someone already beat us to the punch."
finish
endif
echom "jhunt: looks like this plugin is OFF TO THE RACES!"
You can't really leave these lying around, since they do print out before the 'visual' mode of vim starts up, forcing you to constantly press Enter every time. But you can comment them out with a "
when you're done with them (or just remove them, if you're feeling overconfident).
If the script is even being loaded (definitely not a guarantee), you'll see this:
→ vim x.lisp
jhunt: i'm trying to load this plugin, here goes...
jhunt: looks like this plugin is OFF TO THE RACES!
Press ENTER or type command to continue
g:
& b:
VariablesMost vimscripts deal with variables. SLIMV, for example, attempts to auto-detect what Common Lisp implementation is installed on the local machine. To avoid doing this for every new buffer, it sets a global variable (that's what that g:
prefix means), called g:slimvlisploaded
. The auto-detection logic checks for the existence of this variable. And now you can too!
:echo g:slimv_lisp_loaded
1
Do that in the vim command window (normal mode), and you'll either get a value (as above) or not. Initially, the script wasn't running properly -- more on that in a bit. I figured that out by trying to echo the "did I do it yet" variable, and getting a big ugly undefed error from vim.
Virtually every professional filetype plugin (i.e. not mine) honors the b:did_ftplugin
variable. That b:
prefix means "buffer-local" -- the variable has a different life for each buffer vim has open. You can also echo those, but be aware of which buffer you're in when you do so.
:echo b:did_ftplugin
1
vim lacks namespaces for variables (aside from b:
and g:
) so plugin authors conventionally prefix all of their variable names manually. This is super helpful for debugging, since vim will tab-complete variable names when you are trying to :echo
them.
:echo g:
g:copy_as_rtf_preserve_indent g:loaded_spellfile_plugin
g:copy_as_rtf_using_local_buffer g:loaded_tarPlugin
g:did_indent_on g:loaded_vimballPlugin
g:did_load_filetypes g:loaded_zipPlugin
g:did_load_ftplugin g:mapleader
g:ft_ignore_pat g:markdown_fenced_languages
g:lisp_rainbow g:matchparen_insert_timeout
g:loaded_2html_plugin g:matchparen_timeout
g:loaded_copy_as_rtf g:n7_rainbow
g:loaded_getscriptPlugin g:ruby_minlines
g:loaded_gzip g:syntax_on
g:loaded_logiPat g:vimsyn_embed
g:loaded_matchparen g:zipPlugin_ext
g:loaded_netrwPlugin
:echo g:n7
g:n7_rainbow
Since I ran this from the text of this blog post, the SLIMV stuff isn't present, since I don't write this blog in Lisp (yet).
Generally speaking, I find most of my vim plugin flubs come from bad pathing and incorrect installation. If vim can't see the code, it can't execute the code. Which is why I was super stoked to find :scriptnames
.
Go ahead, fire up vim and run :scriptnames
in normal mode.
...
Isn't that awesome!? It prints out the absolute path to every vimscript and plugin that vim loaded during startup; core stuff, user stuff, it's all there!
I hope this helps you next time you need to debug your really borked up vim installation, future James. If anyone else finds this useful, all the better!
Happy Hacking!
]]>I've been obsessed with Lisp for a long time. For most of that time, I've had dreams and aspirations of building a new Lisp dialect that compiled down to machine code as a standalone static executable. I reasoned that if only I could ship static binaries, no one would care what language I wrote the software in, and I could finally use Lisp!
Last week, that all changed, when it finally dawned on me that nobody cares what language I write my software in, because most of the time it's getting stuffed into:
Suddenly, I was without reasons blocking me from using Lisp! I packaged up SBCL into a BOSH Release, and then I went off to build a Common Lisp buildpack for Cloud Foundry! It all worked! Yay!
Now, I needed to get down to the business of actually writing software (the scary part)! So now I had to go find solutions for all the things I was used to getting in other languages, like JSON libraries, web frameworks, etc.
This is the story of building a small web API server in Common Lisp. We're going to play with some cool stuff, and it won't be as painful as you might think.
Here goes.
Libraries are a programmers best friend. Networks for finding and installing libraries on-demand are even more dear.
For this little foray, we'll be using Quicklisp. Go read the docs and get it installed if you're going to play along at home. We're going to be leveraging three libraries, Hunchentoot, Drakma, and cl-json.
;; put this at the top of your Common Lisp source file
(load "~/quicklisp/setup.lisp")
(ql:quickload :hunchentoot)
(ql:quickload :drakma)
(ql:quickload :cl-json)
I don't have time to come up with a toy application, so we're just going to build a real one.
Shout! is a notifications gateway that I've wanted to build for a long time. Eventually, it will grow into a flexible means of applying business logic to break/fix style notifications, determing how, when, and where to dispatch to real messaging systems like email, SMS, Slack, etc.
For now, it's just going to send all the messages to Slack, 24/7. This is a blog post, not a Ken Burns documentary.
Shout! operates on a stream of events, and maintains a set of states that result from said events. If you've ever dealt with a failing Concourse pipeline, you've probably seen a string of failure messages (pipeline's broke. pipeline's broke. pipeline's broke.) followed by silence (by which you know the pipeline is now fixed). With Shout!, the first failure elicits an "it's broke" notification, and the next success sends the all-clear.
Let's start with the types of objects we'll be dealing with.
An event is a single input from the outside world:
(defclass event ()
((message
:initarg :message
:accessor message)
(ok
:initarg :ok
:accessor event-ok?)))
A state tracks events occurring to a "topic":
(defclass state ()
((name
:initarg :name
:accessor state-name)
(status
:initarg :status
:initform "unknown"
:accessor status)
(last-event
:initarg :last-event
:accessor last-event)))
Then we can define an association list of states, indexed by topic:
(defvar *states* ())
We're going to need a way of applying a new event to a state, and thereby modify the status, to go from "working" to "broken", for example:
(defun still-ok? (e1 e2)
(and (event-ok? e1)
(event-ok? e2)))
(defun transition (e1 e2)
(cond ((still-ok? e1 e2) "working")
((event-ok? e2) "fixed")
(t "broken")))
still-ok?
checks two events, which should have occurred in sequence, and returns whether or not the state they belong to is still ok. There's no point in notifying people that things are still good all the time.
The transition
function takes two events (also in sequence) and returns the word that describes the transition. We'll use this in our Slack notification to say stuff like the ice cream parlor is still broken and the political divide is now fixed.
(defun update-state (state event)
(let ((prev (last-event state)))
(setf (status state) (transition prev event)
(last-event state) event)
(when (not (still-ok? prev event))
(notify-about state))
state))
(defun create-state (key topic event)
(let ((state (make-instance 'state
:name topic
:last-event event
:status (if (event-ok? event)
"working" "broken"))))
(setf *states* (acons key state *states*))
state))
(defun ingest (topic event)
(let* ((key (intern topic))
(state (cdr (assoc key *states*))))
(if state
(update-state state event)
(create-state key topic event))))
Finally, ingest
takes a topic (string), and an incoming event, and does the needful with states
; if we already have a state for the given topic, we determine if there was a transition, handle the bookkeeping, and even send out a notification. Otherwise, a new state gets created and put into the association list (via acons
).
We'll skip notify-about
for right now.
Before we can get to the webby stuff, we need JSON representations for our event and state objects. One thing I've found is that whenever I need a library in Common Lisp, it's out there. A good place to start is Quicklisp. Sure enough, they have a cl-json
library!
let's build some helper functions to transform our objects into structures suitable for JSON-ification:
(defun event-json (event)
(when event
`((message . ,(message event))
(ok . ,(event-ok? event)))))
(defun state-json (state)
(when state
`((name . ,(state-name state))
(status . ,(status state))
(last . ,(event-json (last-event state))))))
We can pass the output of these two functions directly to the JSON library to get back JSONified strings.
Now we need to go get ourselves a web server. Luckily, Common Lisp has a pretty nice one, the eminently google-able Hunchentoot. It works like most other web dispatch frameworks — set up some handlers and let the library do the heavy lifting.
We only really need one endpoint, POST /events
. We're going to build two. GET /states
will return the full states database, in JSON form.
(defun api (port)
(defun handle-get-states ()
(setf (content-type* *reply*) "application/json")
(format nil (encode-json-to-string *states*)))
(push (create-prefix-dispatcher "/states" 'handle-get-states)
*dispatch-table*)
(start (make-instance 'easy-acceptor :port port)))
handle-get-states
is our worker function; it sets the Content-Type HTTP header to indicate that we're sending down JSON, and then JSONifies the states database.
Next up, POST /events
takes a JSON object with message
and ok
keys, tracks the event, and reacts. Mostly this is just a call to ingest
, with some serialization functions.
(defun json-body ()
(decode-json-from-string
(raw-post-data :force-text t)))
(defun attr (object field)
(cdr (assoc field object)))
(defun api (port)
(defun handle-get-states ()
(setf (content-type* *reply*) "application/json")
(format nil (encode-json-to-string *states*)))
(push (create-prefix-dispatcher "/states" 'handle-get-states)
*dispatch-table*)
(defun handle-post-events ()
(setf (content-type* *reply*) "application/json")
(let ((b (json-body)))
(format nil "~A~%"
(encode-json-to-string
(state-json
(ingest
(attr b :topic)
(make-instance 'event :ok (attr b :ok)
:message (attr b :message))))))))
(push (create-prefix-dispatcher "/events" 'handle-post-events)
*dispatch-table*)
(start (make-instance 'easy-acceptor :port port)))
I introduced some helper utilities. json-body
decodes the raw HTTP request body (for our POST endpoint). attr
is just shorthand for retrieving the value of a key from an association list. This is what Lisp is all about to me: writing small, composable utility functions that make the rest of the codebase clearer.
Now we have everything we need in our api
function. We could stop here, but I couldn't shake a bad feeling while I was writing that last version. It's a lot of repetition, and it's a bit awkward. For each endpoint, we define a function, then we hook it into the dispatcher.
I'd rather do this:
(handle "/path/to/register"
;; body of the handler
;; do stuff, and return an object
*states*)
You may not be able to do this in other languages easily, but this is Lisp!
Let's write a macro.
(defmacro handle (url &body body)
(let ((fn (gensym)))
`(progn
(defun ,fn ()
(setf (content-type* *reply*) "application/json")
(format nil "~A~%" (encode-json-to-string (progn ,@body))))
(push (create-prefix-dispatcher ,url ',fn) *dispatch-table*))))
(defun api (port)
(handle "/states" *states*)
(handle "/events"
(let ((b (json-body)))
(state-json
(ingest
(attr b :topic)
(make-instance 'event :ok (attr b :ok)
:message (attr b :message))))))
(start (make-instance 'easy-acceptor :port port)))
With this macro, we've simultaneously cut down on the number of lines of code, and increased the clarity of the api
function. And since its a macro, there's no runtimepenalty! Win!
The last piece of this puzzle is integrating with Slack. It's high time we implemented notify-about
. For that, we'll need an HTTP client library, and that's what Drakma is for.
If you recall, I plan to eventually do more than just Slack, so we're going to implement a standalone function for sending messages via Slack.
(defun slack (ok summary details)
(http-request
*slack-webhook*
:method :post
:content (encode-json-to-string
`((text . ,summary)
(username . "shout!bot")
(icon_url . "https://bit.ly/2AC9vAV")
(attachments
((text . ,details)
(color . ,(if ok "good" "danger"))))))))
Then notify-about
just becomes a thin wrapper:
(defun notify-about (state)
(let ((event (last-event state)))
(slack (event-ok? event)
(format nil "~A is ~A"
(state-name state)
(status state))
(message event))))
Here's the final code, which you can take for a spin by running this in the SBCL REPL:
(load "shout.lisp")
(api 8080)
This (complete, working) implementation clocks in at 122 lines of code. That's pretty impressive, and speaks to both the breadth of the language, the extent of the standard library, and the power of macros.
In a future post, I'll go into more detail about how I package up Common Lisp code like Shout! into BOSH, and how I built the Cloud Foundry Buildpack for Common Lisp. Stay tuned!
]]>Empathy is more important than logic or "correctness".
Defined as the the ability to understand and share the feelings of another, empathy derives from the Greek pathos, which survives into modern English and connotes the ability to evoke great feelings (usually of pity or sadness). When we empathizes, we identify with the plight of the other, and begin to see things from their context.
Logic, on the other hand, is just a bunch of hard and fast rules that, while handy for programming computers, are mostly inapplicable to the world of humans.
A life (not lived in complete solitude) necessarily involves dealing with other humans. Unless you are content to accept that what may come may come, and that there is little one can do about their lot in life, life means bringing people around to your point of view. I call this projected empathy.
Let us suppose that I want you to employ some creation of mine to solve your problems, and in return, compensate me for the value I bring. This is called sales.
The Engineer in me (who is, internally, at least, quite persuasive) wants to appeal to your logical side. The solution is feature-ful. It solves your problem. The price point is within your budgets. The implementation is clean; the design, impeccable. You would be a fool not to see that. Buy my stuff!
Why aren't you opening your checkbook?
You are not a logical being. I am not a logical being. Despite what logic may dictate, humans are ruled by feeling and emotion. If you do not feel that mine is a solution that fits your problem, you're not going to buy.
"That's ludicrous!" screams the engineer! Clearly, this is the correct and true way forward. "Why can't you see that I'm right?!" he whines.
The Engineer wants empathy by fiat of logic. This projected empathy is not earned, it is forced (or at least attempted to be).
The Salesperson in me knows better than the engineer. The first thing he does is ask you about you. What do you do? How do you do it? What problems do you face? These questions establish a rapport. He isn't selling — at least not like that engineer. This rapport builds a bridge, from you to him. It binds you in trust. Humans, wrapped up in their feelings, love nothing more than talking about themselves, their interests their accomplishments. It's why we tweet, blog, podcast, and speak at conferences.
More importantly, in coming to understand who you are and what motivates you, the salesperson develops empathy. Empathy is the best way to connect with someone and figure out how to bring them around to your point of view
This sounds crass, I know, but let me explain the nuance here.
Empathy changes the empathizer. Much like a well-wrought logical discussion can move both sides towards a common agreement, empathy opens up, inside the empathizer, new feelings. Humans mimic. It's why babies spend so much time smiling at their parents; it's a game, one hard-wired deep into our bones.
When you empathize, you not only understand the other person, you also come to appreciate their views, and in doing so, modify your own. Empathy without change is not really empathy; it's a charade.
I wrote this article to understand how empathy works (empathizing with empathy), and to get my feelings and thoughts out and into prose (self-empathy). I hope there are lessons in here for others.
Thank you.
]]>This is an article in the ongoing series on implementing Rook Lisp.
I can't live with a language that doesn't support functions of variable arity, and I refuse to create a language that forces support for variadic functions through list processing.
In Rook Lisp, I want to do this:
(printf "Hello, %s\n" name)
And I definitely don't want to do this:
(printf "Hello, %s\n" (list name))
So, Rook needs variadic functions, both in the core of the language, and for users to define. To the latter, we need a syntax. I figured I'd do what all language designers do best: go steal an idea from another language.
Thankfully, Rosetta Code makes this a lot easier than it used to be. After reading through all the examples of definition and usage of variadic functions in different languages, I've come to group them (the languages) into six categories.
There are, of course, languages in which you can't express multi-variadic functions.
In languages that don't support variadic function, but do support arrays of variable lengths, or even lists, you can fake n-ary functions by passing a list as the last parameter.
ALGOL-68 does that:
PROC printall = (FLEX[]STRING argv) VOID: (
FOR i TO UPB argv DO
print(argv[i])
OD
);
printall(("Algol-68", "uses", "lists"))
In Perl, there are two things about function and lists that make variadic support possible, without first-class support in the language:
@_
because of this, all functions in Perl are variadic, it's just that most Perl programmers don't abuse it.
sub printall {
print "$_\n" for @_;
}
printall("Perl", "uses", "lists");
Some languages fake it by initializing unspecific positional parameters to a default value. AWK does this; any parameters the caller omits will be set to the empty string, ""
.
function printall(a,b,c,d,e,f){
if (a != "") print a
if (b != "") print b
if (c != "") print c
if (d != "") print d
if (e != "") print e
if (f != "") print f
}
printall("AWK", "handles", "missing", "arguments")
There are some severe downsides to this method. For starters, you can't define a truly variadic function; there is always an upper bounds on the number of arguments a caller can supply. Additionally, callers cannot pass the default value explicitly.
JavaScript takes a novel approach: all functions can be given multiple arguments, but you can't lexically bind them to formal parameters. Instead, the function just accesses the special variable arguments
, an array-like special object (it has a .length
!)
function printall() {
for (var i = 0; i < arguments.length; i++) {
console.log("%v\n", arguments[i]);
}
}
printall("js", "uses", "a", "special", "variable");
As with Perl, since arguments
is array-like, you can pass it around, slice it, dice it, and even use it with apply()
. Not bad for the introduction of a reserved keyword!
Ruby has what's called a splat operator. If you prefix the name of the last formal parameter with the *
sigil, Ruby accumulates all of the variable parameters, in each call, into that parameter, as a list.
def printall(*args)
args.each do |arg|
puts arg
end
end
printall("Ruby", "uses", "symbol", "modifiers")
Go prefixes the type of the formal parameter with three dots, but otherwise it works the same way as Ruby:
func PrintAll(args ...string) {
for _, arg := range args {
fmt.Printf("%s\n", arg)
}
}
PrintAll("Go", "uses", "symbol", "modifiers")
Other languages (notably Lisp dialects) introduce a sigil, or symbol, into the function signatures, and the variable after the symbol gets bound to the "overflow" parameters.
Clojure does it with &
:
(defn print-all [& args]
(doseq [a args]
(println a)))
(print-all :clojure :uses :symbols)
Common Lisp and Emacs Lisp use &rest
:
(defun print-all (&rest args)
(dolist (arg args)
(print arg)))
(print-all 'lisps 'use 'symbols)
Scheme uses the .
operator, which mirrors the printed form of an improper list (a cons
with a non-cons cdr
):
(define (print-all . args)
(for-each
(lambda (x) (display x) (newline))
args))
(print-all 'lisps 'use 'symbols)
Interesting side note: Scheme and other Lisp dialects differ on how they define functions. Scheme mimics the calling form in the definition, i.e:
(define (function arg1 arg2 etc) ...)
whereas other Lisps split the argument list from the function name:
(defun function (arg1 arg2 etc) ...)
This might be the main reason Scheme can use .
and other Lisps uses &
or &rest
.
After looking at how everyone else does it, and talking it over with some nerd friends of mine who also like dreaming up new languages, I've settled on a novel approach:
(fn (print-all (msgs))
(for (m msgs)
(printf "%s\n" m)))
The presence of a single, single-element list in the call signature of the definition indicates to the compiler that the rest of the variadic arguments go in the msgs
parameter.
There are limits to the notation:
My favorite thing about this notation is that the intent is clear, without the addition of any new lexer syntax, or any new keywords.
Happy Hacking!
]]>A few gigs ago, I had a boss who was trying to single-handedly move the organization in which he found himself into the future of agile and dev ops. This was 2012, and agile wasn't a thing enterprises did, let alone mid-sized adware companies.
Garrett (my boss) liked to talk about T-shaped people. "We need more T-shaped people in this company." We were always trying to hire T's, instead of I's.
I left that job -- not because of Garrett; he left a year before I did -- but I still think a lot about his optimistic, borderline quixotic quest, and his T-shaped people.
Last week, someone asked me what she should do to reach my level of expertise comfort with technology. I get asked this often, and I don't have a really good answer -- primarily because I don't think there is any single, specific thing that I did to get where I'm at. I think I just lucked into being a T-shaped person.
In my daily work, I bring to bear all manner of weird and unusual disciplines and experiences. I once spent a summer compiling Linux kernels just to see if I could do it. I tried to write a Zork-like adventure game using an Apple ][e and learned first-hand why functions and recursion are a good thing. Most of my troubleshooting experience, summed up as Check Your Assumptions, derives from a decade of making bone-headed assumptions and wishing desperately to avoid the embarrassment again.
I'm a T-shaped person, and if you want to excel in whatever you do, I think you need to be a T-shaped person too.
]]>Lisp fascinates me. I think it's the axiomatic, constructive nature of the thing. From a handful of operators and special forms spring ten thousand functions and countless libraries. The only other language that comes close to the simple-complexity of Lisp is C.
I think every Lisper tries their hand at implementing Lisp, via a meta-circular evaluator. That is, use a Lisp to implement a Lisp interpreter. Some even go so far as to implement Lisp on top of another language runtime, like Python or LLVM.
I've been programming for over two decades now, in about a dozen different languages. I've seen what I like in these languages, and I remember what I dislike. So the time has come (... melodramatic pause...) to build a better mousetrap language.
It's going to be Lisp, and it's going to be bootstrapped in C.
I call it Rook.
It has all the hallmarks of a "classic" Lisp:
It also has lots of things that I think are important, that don't seem to make it into language specifications:
Future writings will cover these topics in more detail. For now, I want to show you my plans for the language, in 100% vaproware code snippets!
You just can't implement a language without this snippet:
(import io)
(fn (main)
(io.printf "Hello, World!\n"))
io
is a standard library for doing input / output. The io.printf
function derives its name from the import.
Standalone binaries need an entrypoint, and following the traditions of C, we call that entrypoint main
.
(fn (fib n)
"Calculate the n'th number in the Fibonacci sequence"
(when
(< 1) (bail "invalid!")
(eq? n 1) 1
(eq? n 2) 1
#t (+ (fib (- n 1))
(fib (- n 2)))))
This is a naïve recursive implementation that finds Fibonacci numbers, but you already knew that because of the helpful documentation string. The compiler will remove that, by the way, since it has no effect on either the computation or the outside world.
The (when ...)
construct is just a multi-branch if ... then ... else if ... as you might find in other languages. Conditionals are evaluated in-order until a true value is found, and then the paired consequent clause is evaluated.
First-class functions are super useful.
(fn (evens lst)
(filter lst
(lambda (x) (even? x))))
Here, we pass a lambda to the (filter lst f)
call, which will apply f
to each item in lst
, and return the subset for which f
returned #t
.
(fn (foo)
(with (x 3
y 4)
(+ x y)))
The (with ...)
form introduces new variable bindings, shadowing any lexically "outer" bindings for the duration of the with
form.
(fn (main)
(let (ch (chan))
(thread
(for n ch
(-> ch (* 2 n))))
(for n (list 1 2 3 4)
(-> ch n)
(printf "%d x 2 = %d\n" n (<- ch)))))
This (rather contrived) example demonstrates some co-processing capabilities, which is all based on passing messages via channels. (chan)
creates a new channel, which we store in the ch
variable.
Then, the (thread ...)
form steps in and starts a new thread, executing the contained statements, which is just a for loop over the channel. The (-> ch x)
form sends the value of x
to the channel ch
. On the flip-side, (<- ch)
returns the next value available in the channel.
The main thread then loops over the list (1 2 3 4)
, sending each number off to the thread for processing, and printing the results.
Some people don't like S-expressions. That's fine. If you're willing to sacrifice macros (which more or less require S-exprs), there's no reason you can't still use Rook!
This S-expr program:
(fn (main)
(printf "Hello, World!\n"))
Is equivalent to this alt-syntax program:
fn main() {
printf("Hello, World!\n");
}
All Rook needs is an alternate lexer/parser that can turn the latter into the former. The compiler, of course, never sees the alternate syntax, which means a program can use libraries written in either style! Win!
What good is a language without standard libraries?
(import io)
(io.printf "hello, world!")
Imports the standard input/output library. Functions will be prefixed with io.
, like io.printf
.
(import net/http)
(http.connect "https://jameshunt.us")
Here, we've imported a multi-label library, net/http
. The prefix will be taken from the final component of the directory path.
(import (web net/http))
(web.connect "https://jameshunt.us")
If you want, you can provide your own namepsace, by using the two-element list form of the (import ...)
call.
(import (. io))
(printf "hello, world!\n")
The special prefix .
tells Rook to import symbols directly into the main namespace, so you don't even need a prefix!
You can even combine these imports into one call:
(import
io
(web net/http)
(. string/utils))
What good is a language without user-defined libraries?
(lib base64)
(export fn (decode s)
(implementation-wanted!))
(export fn (encode s)
(implementation-wanted!))
The (lib ...)
form defines that the following declarations belong in a library. The export
decorator for the (fn ...)
form defines a function that will be available after an import.
This is all just dreams at this point, but I'll be working in earnest on an implementation of Rook. Stay tuned!
]]>X.509 is everywhere. You may not realize it, but these words were sent to your screen under the privacy of X.509 PKI, as part of the TLS protocol that puts that pretty little green lock up in the URL bar.
When I first encountered X.509 Certificates, they were big, scary, expensive things that only e-commerce sites, bank, and paranoid people wanted. They were so expensive, and so complicated, in fact, that architectures often centralized their usage, so you may have x.jameshunt.us, y.jameshunt.us, and z.jameshunt.us for the regular traffic, but the secure traffic was forced through secure.jameshunt.us (because that's what you paid to get in the certificate!)
Nowadays, with things like Let's Encrypt, you can pretty much get a certificate for free, with only minimal effort expended to prove you are who the certificate says you are.
But this isn't a post about Let's Encrypt.
This is a post about how to handle all those pesky systems that don't natively implement ACME. Systems that need / want X.509 certificates, but expect you to do all the legwork in getting one.
Systems like SHIELD, Cloud Foundry, and Vault (to name a few).
Here's the thing: generating X.509 certificates is hard. Google around for a bit and you'll find a hundred different ways to invoke openssl
, different ways of setting up a transient CA directory, and different ways of generating the signing key (some of which even use 1024-bits ferchrissakes!) Most of the time, these processes pop out a self-signed certificate that's good for a year. If you're lucky, the certificate has Subject Alt Names (SANs).
Judicious use of self-signed certificates leads to "browser exception fatigue" (a real medical malady, I assure you). Try as you might to only temporarily set an exception just for this one session each and every time, one day you're going to forget and not uncheck the bad little checkbox.
Ideally, you run your own Certificate Authority.
Yeah, you heard me. You can run your own Certificate Authority. It's not that hard, and if you bear with me here, I'll show you exactly what to do to keep it as safe and secure as I know how.
Contrary to what you may have heard, you can run your own Certificate Authority. When you get right down to it, a CA is nothing but a key and a promise. The key is used to sign certificates that vouch for identities, and the promise is "I will vet the people and systems I sign certificates for."
The one thing you won't easily be able to do is get your CA certificate loaded into every browser and operating system in marginal use today. That's where Verisign, GeoTrust, DigiCert, Thawte, GoDaddy, Network Solutions, et al have you beat -- they are already trusted by the browsers and the OS you're using.
But the difference between logging security exceptions for self-signed certificates (that you generate) and running your own certificate authority is twofold. In the best case, you can just install your CA certificate in the browsers trusted CA store, or teach your OS how to trust new CAs. In the worst case, you're doing the exact same amount of work to ignore security warnings!
There is one downside to being a CA: the security concerns get real, and fast. With one-off self-signed certificates, a leaked private key would allow an attacker to impersonate that system. But if you lose control of the CA signing key, an attacker can issue new certificates for things like totes.mybank.com
and you may not notice (yay green padlock!)
The key here is encryption. And yes, I did think about editing out that accidental pun. But I didn't. You're welcome.
Encryption, encryption, encryption.
What we need is a system that can keep our CA signing key secret, and only bring it out when we need to use it. Luckily, there's a freely available, rock-solid, open source solution that works on every major operating system I'd want to run on: Vault.
Vault provides a secure credentials storage system. Coupled with safe, an operator-friendly command-line tool I wrote, we have the makings of a super-easy-yet-secure setup.
First, you're going to need a place to run the Vault. You can go totally secure and air-gap a laptop or old desktop workstation, if you want. Or, you can just spin the Vault process up when you need it and leave it offline otherwise. I prefer the latter, and I usually do it on a Linux box.
My daily workhorse is a maxed-out Macbook Pro, so why do I prefer Linux? Because it has a neat little system call named mlock(2)
.
From the man page:
lock part or all of the calling process's virtual address space
into RAM, preventing that memory from being paged to the swap
area.
This is important, because Vault is a long-running process, and it will be handling secrets directly. With mlock(2)
, we can rest assured that the memory management unit (a critical part of any modern OS) isn't going to barge in and copy sensitive memory to an unencrypted swap file somewhere.
(Note: Vault does have an option to turn off its mlock behaviors, but they don't recommend doing so and neither do I.)
So, Linux it is.
First, we need a place to store the encrypted data while the Vault is powered off.
$ mkdir -p ~/.cavault/data
$ sudo chown -R jrhunt:root ~/.cavault
You'll want to go download the Linux Vault Binary (which comes package inside of a zipfile). I recommend putting this inside the ~/.cavault
root directory we just created.
$ curl -Lo ~/.cavault/vault.zip \\\\
https://releases.hashicorp.com/vault/0.9.0/vault_0.9.0_linux_amd64.zip
$ (cd ~/bin && unzip vault.zip && rm vault.zip)
In order to run vault
as a non-root user, while still maintaining the ability to mlock(2)
the sensitive bits of memory, we can use setcap
on the vault
binary:
$ sudo setcap cap_ipc_lock=+ep ~/bin/vault
Configuring Vault is pretty straightforward. I'm not going to enable too much functionality — Vault can do a lot more than what we're going to use it for. Here's a good starting configuration:
$ cat >~/.cavault/conf <<EOF
listener "tcp" {
address = "127.0.0.1:8200"
tls_cert_file = "$HOME/.cavault/tls.pem"
tls_key_file = "$HOME/.cavault/tls.key"
tls_min_version = "tls12"
}
storage "file" {
path = "$HOME/.cavault/data"
}
EOF
With this configuration, Vault will listen for requests on https://127.0.0.1:8200
, over TLS 1.2, and will store the encrypted, at-rest credential data in our ~/.cavault/data
directory.
You may notice that I snuck another X.509 certificate and key in there. It's turtles, all the way down. For my specific setup, I reused a wildcard certificate that was valid for another domain. If you want (and if you're only running over loopback) you can specify tls_disable = 1
and run over HTTP until you can generate a certificate using safe
+vault
. So meta.
Next, let's write a small script to run the Vault, since remembering command-line arguments and paths is annoying.
$ cat ~/bin/ca <<EOF
#!/bin/sh
exec vault server -config ~/.cavault/conf
EOF
$ chmod 755 ~/bin/ca
All that's left to do now is to fire it up.
$ ca
==> Vault server configuration:
Cgo: disabled
Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", tls: "enabled")
Log Level: info
Mlock: supported: true, enabled: true
Storage: file
Version: Vault v0.9.0
Version Sha: bdac1854478538052ba5b7ec9a9ec688d35a3335
When you start Vault up the first time, it will be in an uninitialized state. You won't be able to read secrets from it or put secrets into it. To initialize the Vault, let's use safe:
$ safe target ca https://127.0.0.1:8200
Now targeting ca at https://127.0.0.1:8200
$ safe vault status
Error checking seal status: Error making API request.
URL: GET https://127.0.0.1:8200/v1/sys/seal-status
Code: 400. Errors:
* server is not yet initialized
!! exit status 1
$ safe vault init
Unseal Key 1: wXTO7wdmF0TVfqivzw6OaJiVeC+QNacAanTsxQ1RlKad
Unseal Key 2: QQugZ0RmflcMzHTLQCt1StP3XXGiCMWOabhWkXvuDMUS
Unseal Key 3: 1jzeotmjwLahW7s8zhvJTs/npVgBVwjS5cAYgRjBJf9x
Unseal Key 4: cN+wF2qw3xliLwxzU6t+/DnYtBoW2FC1RYom/NEI4XNj
Unseal Key 5: E0Q/IRJr9RiWNW3kR6uvSSo1ehZ8AOO0+jvugQQgy1Q4
Initial Root Token: b2318812-b5f2-0966-3a1b-ea964bc1512d
Vault initialized with 5 keys and a key threshold of 3. Please
securely distribute the above keys. When the vault is re-sealed,
restarted, or stopped, you must provide at least 3 of these keys
to unseal it again.
Vault does not store the master key. Without at least 3 keys,
your vault will remain permanently sealed.
Your unseal keys and initial root token will be different.
Now the Vault is initialized, but it is still sealed. You'll need to run safe vault unseal
three times (turning widdershins optional), supplying a different unseal key each time.
$ safe vault unseal
Key (will be hidden):
$ safe vault unseal
Key (will be hidden):
$ safe vault unseal
Key (will be hidden):
Finally, we can authenticate with our root token:
$ safe auth token
Authenticating against ca at https://127.0.0.1:8200
Token:
safe
The first thing I do on any Vault I administer is to set a "canary" value:
$ safe set secret/handshake knock=knock
knock: knock
$ safe tree
.
└── secret
└── handshake
safe tree
is just one of the handy little commands that safe provides. The one we'll be most interested in for this project is safe x509
— there's a whole subsystem for generating CAs, issuing certificates, managing CRLs, etc!
First things first, we need an actual certificate authority certificate, one that is capable of signing other certs.
$ safe x509 issue --ca \\\\
--name ca.jameshunt.us \\\\
--bits 4096 \\\\
--subject /cn=ca.jameshunt.us \\\\
--ttl 10y \\\\
secret/jameshunt.us/ca
The first flag, --ca
tells safe we want a CA cert.
The --name
flag sets a subjectAltName for the certificate, identifying who the CA is. What you choose is up to you, and what DNS domains you own, but using the node name ca
is a good practice to get into.
The --bits
flag is important — it sets the RSA private key strength, in bits. Valid values are 1024 (highly discouraged, if not outright broken), 2048 (okay for leaf certs) and 4096 (super-strong). I'll take the strongest my CPUs can handle, thank you very much!
The --subject
flag identifies the CA by providing a full identity. Sometimes you will see this as something like
/cn=ca.foo/c=US/st=New York/c=Buffalo/o=Hunt Productions, Inc./ou=R&D
Which of these relative distinguished names you choose to specify is entirely up to you. Check out RFC 5280 for details.
The --ttl
flag determines how long the CA certificate is good for. I'd like a decade or so before I have to rotate out my certs, but you are welcome to choose a shorter time-to-live.
The last argument to safe x509 issue
is the path, inside the Vault, where safe should store all the bits and pieces of the CA, including its private key and the public PEM-encoded certificate.
If all goes well, your laptop will crank on the RSA key generation phase for a bit and then you'll be back at your command prompt.
$ safe tree
.
└── secret
├── handshake
└── jameshunt.us/
└── ca
You can use safe read
to see what was generated:
$ safe read secret/jameshunt.us/ca
--- # secret/jameshunt.us/ca
certificate: |
-----BEGIN CERTIFICATE-----
MIIE4TCCAsmgAwIBAgIBAjANBgkqhkiG9w0BAQ0FADAaMRgwFgYDVQQDEw9jYS5q
....................... etc. ..................................
ge1b5Xm7GrekaL2VqW/hTLXxnSRk9RzfZl8M421ueRmVlRun8P7J8IkKynx22uNA
Jt/9L4w=
-----END CERTIFICATE-----
combined: |
-----BEGIN CERTIFICATE-----
MIIE4TCCAsmgAwIBAgIBAjANBgkqhkiG9w0BAQ0FADAaMRgwFgYDVQQDEw9jYS5q
....................... etc. ..................................
ge1b5Xm7GrekaL2VqW/hTLXxnSRk9RzfZl8M421ueRmVlRun8P7J8IkKynx22uNA
Jt/9L4w=
-----END CERTIFICATE-----
-----BEGIN RSA PRIVATE KEY-----
tB3AlPKOd0onIcb1pomGZoFaGJQer3Pj8+hlP6ysHF9csAjleMEPmRFFcuDxoOKQ
....................... etc. ..................................
aKjrjAFAl+waomosq6IQtZqFy2ys2z75Lpbas2nHiKQKpIH3CccYjQ==
-----END RSA PRIVATE KEY-----
crl: |
-----BEGIN X509 CRL-----
MIICZDBOAgEBMA0GCSqGSIb3DQEBCwUAMBoxGDAWBgNVBAMTD2NhLmphbWVzaHVu
....................... etc. ..................................
fouAOTFAl+waomosq6IQtZqFy2ys2z75Lpbas2nHiKQKpIH3CccYjQ==
-----END X509 CRL-----
key: |
-----BEGIN RSA PRIVATE KEY-----
tB3AlPKOd0onIcb1pomGZoFaGJQer3Pj8+hlP6ysHF9csAjleMEPmRFFcuDxoOKQ
....................... etc. ..................................
aKjrjAFAl+waomosq6IQtZqFy2ys2z75Lpbas2nHiKQKpIH3CccYjQ==
-----END RSA PRIVATE KEY-----
serial: "4"
The following keys are stored:
safe x509 revoke
So let's pull that CA certificate out and put it on-disk:
$ safe read secret/jameshunt.us/ca:certificate > ca.pem
Easy. You can now import that certificate into your browser or your OS. That's too broad a topic to cover here, but here are some helpful links:
Congratulations, you have a working certificate authority.
safe x509 issue
can also issue certificates as your CA, signing them with the CA signing key, securely:
$ safe x509 issue \\\\
--signed-by secret/jameshunt.us/ca \\\\
--name foo.jameshunt.us \\\\
--name *.bar.jameshunt.us \\\\
--name 10.6.7.8 \\\\
--ttl 90d \\\\
secret/jameshunt.us/a-new-cert
The chief difference here is that we specified --signed-by
as the path to our CA in the Vault. That's all safe x509
needs to know in order to properly sign your new certificate.
Another new thing is that we specified --name
multiple times. You can have as many subject alternative names as you need (within reason), and they can be regular names like foo.jameshunt.us
, wildcards like *.bar.jameshunt.us
, IP addresses like 10.6.7.8
, and even email addresses (not shown).
If you want to know the particulars for a certificate in your CA vault, you can use safe x509 show
:
$ safe x509 show secret/jameshunt.us/a-new-cert
secret/jameshunt.us/a-new-cert:
cn=foo.jameshunt.us
issued by: cn=ca.jameshunt.us
expires in a 89 days
valid from Nov 17 2017 - Feb 15 2018 (~90 days)
for the following names:
- foo.jameshunt.us (DNS)
- *.bar.jameshunt.us (DNS)
- 10.6.7.8 (IP)
As you can see, this certificate is only good for 90 days, has all the correct SANs, and was issued by our new CA.
If you need to configure nginx to use this cert, you can grab the combined
secret:
$ safe read secret/jameshunt/us/a-new-cert:combined > nginx.pem
Otherwise, the key and certificate are available separately in the key
and certificate
attributes, respectively.
I strongly encourage you to take a look at safe help x509
and the help pages for each sub-command (safe help x509 issue
and friends). safe
has a great many other tricks up its sleeve, including the ability to generate new SSH and RSA keypairs, random password generation, crypt support and reformatting, and more.
When you're not using the Vault, you can shut it down and let the data reside safely and securely on-disk, encrypted. When you need it, fire it up, unseal it, and go to town.
Happy Hacking!
UPDATE (Nov 20, 2017) You should really put the vault
binary in your $PATH
so that safe
can see it. I updated the instructions, code, and configuration accordingly.
B-trees are the fascinating data structures at the heart of almost every piece of data storage and retrieval technology you've ever dealt with. Ever created a relational database table with an indexed column? B-trees. Ever accessed a file in a filesystem? B-trees.
They are so useful because they:
To illustrate how B-trees work, I'm going to use pictures. Later we can get into the code. To make the pictures intelligible, we're going to have to use a smaller degree; each node in our B-tree will only be able to hold 5 keys and 6 links.
We're also going to skip some of the boring parts and start out at "step 1" with 4 elements already inserted into our B-tree:
The blue boxes are the slots in the B-tree node, each containing a key. There are pointers on each side of each slot. For leaf nodes (and our root node is a leaf node at this point), those pointers point to the actual data being stored for each key, i.e., A -> a, B -> b, etc.
Now, let's insert a new key/value pair, E -> e.
Both F and f get moved to make room for E -> e in the correct (sorted) position.
This is the killer feature of B-trees; as elements are inserted, keys remain in sort order. This is why databases and filesystems make such extensive use of them.
We could just stop here, and wait for the next insertion, but we're going to do some proactive housekeeping first. Our B-tree node is now full. We need to grow the tree to give ourselves some breathing room.
We'll do this by splitting our node in two, right down the middle, at D, and migrating D "up" to become the new root. To the left we will place keys A and B (which are strictly less than D), and to the right, E and F.
The choice of D is important. We're hedging our bets by picking the median, because that gives equal footing when key/value pairs are inserted in random order.
Our B-tree now spans three nodes, and looks like this:
Note, particularly, that it is the pointer after B that points to d. You can (and should!) think of this as the pointer between B and D — viewed that way, it's no different from the structure we had in steps 1 and 2.
Now we are starting to see the tree structure emerge.
What happens if we insert C → c?
Insertion starts by considering the root node, which just contains D. Since C < D, and the root node is not a leaf node, we follow the pointer link to the left of D and arrive at the leftmost child node [A, B]. This is a leaf node, so we insert C -> c right after B, remembering to shift the link to d one slot to the right (D > C, after all). This does not fill the child node, so we are done.
Above, we split the node as soon as it was full.
We didn't have to. We could just as easily have waited until the next insertion, and split the node then, to make room for our new key/value. That would lead to more compact trees, at the expense of a more complicated implementation.
To understand why, let's consider a pathological case where we have several full B-tree nodes along a single ancestry line:
Then, let's try to insert the key/value pair C -> c. To do so, we traverse the root node to the H-node, to the A-node. Since the A-node is full, we need to split it at the median, D:
Uh-oh. We have nowhere to put D. Normally, it would migrate up the tree and be inserted into the H-node, but the H-node is full!
We could split the H-node, but it suffers the same problem; its parent (the N-node) is also full! Our simple insertion has cascaded into a sequence of partitioning operations:
Being proactive, and splitting on the way down (as the literature puts it), means that we can never get into this problematic scenario in the first place.
In Algorithms (1983), Sedgewick has this to say:
As with top-down 2-3-4 trees, it is necessary to "split" nodes
that are "full" on the way down the tree; any time we see a
k-node attached to an M node, we replace it by a (k + 1)-node
attached to two M/2 nodes. This guarantees that when the bottom
is reached there is room to insert the new node.
Aside: Algorithms is a wonderful text, worthy of study. Don't let anyone tell you that seminal computer science texts from the 80's (or before!) are obsolete.
By taking this advice one step further, and splitting on insertion, we can ensure that traversal / retrieval doesn't need to be bothered with such trifles.
I spent a fair amount of time studying B-trees for a project of mine, focusing mainly on how best to size the B-tree node for optimal density and tree height.
Tree height is important because our data values are only stored in the leaf nodes. Therefore, every single lookup has to traverse the height of the tree, period. The shorter the tree, the fewer nodes we have to examine to find what we want.
Given that most B-trees live on-disk, with only the root node kept in memory, tree height becomes even more important; fewer disk accesses = faster lookup.
The following equations consider B-trees of degree \(m\) (for maximum number of children allowed in a node).
First, we need to figure out the minimum number of keys required of interior nodes, which we'll call \(d\). Usually, this is:
$$d = ⌈{m/2}⌉$$
Using a median-split, we should never have fewer than \(d\) keys per interior node. The root node is special, since we have to start somewhere, and the leaf nodes are likewise exempt.
The worst-case height then, given \(n\) items inserted, is:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
and the best-case height is:
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
Common wisdom has it that you should choose a natural block size as your node size, in octets. That's either 4096 bytes (4k) or 8192 bytes (8k), so let's run through the calculations.
For \(T = 4096\):
$$m = 340$$
$$d = ⌈m/2⌉ = ⌈340/2⌉ = ⌈170⌉ = 170$$
We can fit 170 pointers and 169 keys per B-tree node. Our worst-case height, given minimal node usage is:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
For my analysis, I was trying to find best- and worst-case scenarios given \(2^32 - 1\) items (one for every second I could fit in a 32-bit unsigned integer). \(n + 1\) then simplifies to just \(2^32\), and dividing by two knocks one off of the exponent, leaving \({n+1}/2 = 2^31\):
$$h_{worst} ≤ ⌊{log_170({2^31)}⌋ ≤ ⌊{3.675}⌋ ≤ 3$$
So with minimally dense nodes, we can expect to traverse 3 nodes to find a value corresponding to any given key. Let's see about best-case scenario. As before, \(n + 1\) simplifies down to \(2^32\):
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
$$h_{best} = ⌈{log_340({2^32)}⌉ - 1 = ⌈3.805⌉ - 1 = 3$$
So, 3, again. That's what I call balance.
For \(T = 8192\):
$$m = 681$$
$$d = ⌈m/2⌉ = ⌈681/2⌉ = ⌈340.5⌉ = 341$$
Using \({n+1}/2 = 2^31\), as before, we can calculate the worst-case for an 8k-page B-tree:
$$h_{worst} ≤ ⌊{log_d({n + 1}/2)}⌋$$
$$h_{worst} ≤ ⌊{log_341({2^31)}⌋ ≤ ⌊{3.684}⌋ ≤ 3$$
As with 4k pages, we can expect to look at 3 nodes before we find the value we seek. Not bad.
What about \(h_{best}\)?
$$h_{best} = ⌈{log_m({n + 1})}⌉ - 1$$
$$h_{best}= ⌈{log_681({2^32)}⌉ - 1 = ⌈3.4⌉ - 1 = 3$$
It would seem that a choice between 4k and 8k won't affect our best and worst case heights; they are always 3.
If you're interested in the genesis of B-trees, you'll want to read Bayer and McCreight's 1970 paper Organization and Maintenance of Large Ordered Indices.
Douglas Comer's 1979 review / survey of B-trees, The Ubiquitous B-Tree, covers the basics of B-trees, explains their ascent to ubiquity, and discusses several major variations, like B+-trees, and their relative merits.
Happy Hacking!
]]>My friend and colleague Dennis Bell hit me up the other day with an odd problem:
hey, I have a problem
apparently perl allows you to create hard links to directories on OSX, but
I have no idea how to get rid of them...
Here's the stat
output he sent me:
$ stat -f "%d/%i %Sp %N" copy ../src/orig
16777220/30433678 drwxr-xr-x copy
16777220/30433678 drwxr-xr-x ../src/orig
Sure enough, two directories, same root device (16777220), and the exact same inode.
This is a huge no-no in filesystem land, if only because preventing hard links eliminates an entire class of unsolvable problems with disk traversal. The tl;dr is that without hard directory links, you can't get cycles in the directed graph of parent -> child relationships. Cycles make graph traversal dangerous, because of the danger of infinite loops.
(Note: symbolic links sidestep this issue by letting the traversal logic know who is the real directory and who is the link; the links can then be skipped.)
In fact, it's so taboo, that the Linux kernel flat out refuses to let you hard link one directory to another. This is what fascinated me so much about my friends problem - How in blazes was this even possible??
Since I have spent a fair amount of time poking about in the Linux kernel source tree, and because I have years of experience writing system code to run on top of Linux, I figured I'd start there.
(This is based on commit 7eb97ba
from Linus' tree)
→ grep -rn SYSCALL_DEFINE * | grep '\blinkat'
fs/namei.c:4239:SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
So linkat(2)
is implemented in fs/namei.c
, but since it is a long function (by blog standards), I'll not repost the whole thing here. You are more than welcome to read the full listing here.
At some point during execution, once it has figured out just what you're trying to link up, linkat(2)
calls vfs_link(...)
, which checks to see if the target of the link is a directory (line 4200):
if (S_ISDIR(inode->i_mode))
return -EPERM;
Linux, unequivocally, forbids the creation of hard links to directories. It's hard-coded into the kernel. It's not configurable. It's not filesystem-dependent. It's the way things are.
A quick git blame
on the namei.c source file turned up a long history:
7e79eedb3b2 (Tetsuo Handa 2008-06-24 16:50:15 +0200 4199) if (S_ISDIR(inode->i_mode))
^1da177e4c3 (Linus Torvalds 2005-04-16 15:20:36 -0700 4200) return -EPERM;
7e79eedb
was a variable-reuse patch to clean up the code slightly.
1da177e4
is the initial import of Linux kernel 2.6.12-rc2 into Git. Since I've already invested a sizable amount of time in this particular science expedition, I started looking at old tarball dists of Linux from kernel.org.
In Linux 2.5.5, Al Viro migrated the Big Kernel Lock from vfslink()
to the filesystem-specific iop->link()
handler. At the same time, he hoisted the SISDIR()
check up into vfslink()
, effectively deciding for all filesystems that links to directories are verboten. Here's the Changelog entry:
<viro@math.psu.edu> (02/02/14 1.345)
[PATCH] (3/5) more BKL shifting
BKL shifted into ->link(), check for S_ISDIR moved into vfs_link().
Having verified my assumptions regarding directory hard links in Linux, it was time to try to reproduce the "issue" on macOS. Dennis was using Perl when he ran into this, but since this is firmly in kernel-system-call territory, I'm going to use C. Here's a small program I wrote that (thinly) wraps the link(2)
system call:
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
int main(int argc, char **argv)
{
int rc;
if (argc != 3) {
fprintf(stderr, "USAGE: %s /old /new\n", argv[0]);
return 1;
}
rc = link(argv[1], argv[2]);
if (rc != 0) {
fprintf(stderr, "%s -> %s: %s (error %d)\n",
argv[2], argv[1], strerror(errno), errno);
} else {
fprintf(stderr, "%s -> %s: SUCCESS!\n", argv[2], argv[1]);
}
return rc;
}
And here's what happens when you run it:
$ mkdir dir1
$ ./lnk dir1 dir2
dir2 -> dir1: Operation not permitted (error 1)
On the surface, it would appear that macOS does not allow directory hard links. But Dennis assured me he had seen it, with his own eyes, so I tried again:
$ mkdir dir1
$ mkdir copy
$ ./lnk dir1 copy/dir2
copy/dir2 -> dir1: SUCCESS!
Now that is odd. I checked the man page for link(2)
, which states:
In order for the system call to succeed, path1 must exist and both path1
and path2 must be in the same file system. As mandated by POSIX.1, path1
may not be a directory.
A bald-faced lie, it would seem.
I have literally zero experience reading through Apple's darwin codebase, so rather than a code dive, I spoke with a few colleagues. One of them, whose Google-fu is stronger than mine, found this StackOverflow post, which hints that Apple implemented it in OS X 10.5 Leopard, for their Time Machine product, in 2007.
Another SO question (referenced by the first) sheds a little light on the ground rules for this feature:
Snow Leopard can create hard links to directories as long as you follow
Amit Singh's six rules:
1. The file system must be journaled HFS+.
2. The parent directories of the source and destination must be different.
3. The source’s parent must not be the root directory.
4. The destination must not be in the root directory.
5. The destination must not be a descendent of the source.
6. The destination must not have any ancestor that’s a directory hard link.
(Note: I believe the quote is referring to the author of Mac OS X Internals: A Systems Approach, Amit Singh)
So OS X (now macOS) allows you to hard link directories under specific circumstances that are guaranteed to not cause cycles in the filesystem graph. Neat. Unfortunately, standard CLI utilities (BSD or GNU) seem to be caught a bit off-guard by this newfound power.
Consider GNU coreutils, which I brew install
on every Mac I've ever owned:
$ which rm
/usr/local/opt/coreutils/libexec/gnubin/rm
$ rm copy/dir2
rm: cannot remove 'copy/dir2': Is a directory
$ which rmdir
/usr/local/opt/coreutils/libexec/gnubin/rmdir
$ rmdir copy/dir2
rmdir: failed to remove 'copy/dir2': Directory not empty
BSD utils has the same issue:
$ /bin/rm copy/dir2
rm: copy/dir2: is a directory
$ /bin/rmdir copy/dir2
rmdir: copy/dir2: Directory not empty
My initial advice to Dennis was to use unlink(1)
, since he was trying to undo the hard link, and that's precisely what unlink
is for. In fact, once I had reproduced the issue on my laptop, I was able to fix the problem by unlinking the duplicate inode. When he tried it, it said:
$ unlink copy/dir2
unlink: copy/dir2: is a directory
As it turns out, stock unlink
is just a wrapper around /bin/rm
that just doesn't take any arguments:
$ /bin/unlink
usage: rm [-f | -i] [-dPRrvW] file ...
unlink file
The GNU coreutils version of unlink
doesn't have this problem, apparently.
Ultimately, Dennis took the nuclear option and rm -rf
'd his way back to a sane filesystem. By removing everything under one of the directory instances, he was able to rmdir
the other one and start over.
Diff'rent strokes, I suppose.
I found this ordeal fascinating; I hope you did too. Here's a few things I've learned.
Happy Hacking!
]]>I've been writing some internal APIs to abstract away the subtleties of managing multiple network peers behind socket file descriptors. The ideal this mini-library aims for is that client programs can connect up to lots of network endpoints, and then publish lots of little binary messages to all endpoints simultaneously without peering into the eldritch horror that is the Real World of Networked Computers.
The guts of the implementation center around a struct socket
which the client program drives via library functions — connect to this host and that host, send a message to both, wait for a message from either. Each connection has its own socket file descriptor, and it's the job of my library thing to multiplex them properly.
Easy. epoll(7)
to the rescue!
It looks something like this (guaranteed to not actually compile). First, the data structures:
struct peer {
int fd; /* socket file descriptor */
int epfd; /* epoll(7) management descriptor */
void *tx; /* some sort of transmission queue */
struct peer *next;
};
struct socket {
struct peer *peers;
};
The tx
member of struct peer
is a magic, hand-wavy queue. Don't fret too much about it; but trust me, it keeps an ordered list of the messages that still need to be written to this peer.
That's where sendit()
comes in:
void sendit(struct socket *sock, void *msg) {
struct peer *p;
/* don't write(2) anything yet; just enqueue for tx */
for p = sock->peers; p; p = p->next) {
p->tx = enqueue(p->tx, msg);
}
}
This is a PUBLISH model of communication; all of the connected endpoints will receive a copy of each message we send. No round-robin. No LRU.
To do the heavy lifting of the actual send / receive, we turn to pollit
:
void pollit(struct socket *sock) {
int nfd;
struct peer *p;
struct epoll_event ev[MAX_EVENTS];
nfd = epoll_wait(sock->epfd, ev, MAX_EVENTS, -1);
for (i = 0; i < nfd; i++) {
for (p = sock->peers; p; p = p->next) {
if (ev[i].data.fd != p->fd) continue;
if (ev[i].events & EPOLLIN)
recvit(sock, p);
if (ev[i].events & EPOLLOUT
&& p->tx) /* my FATAL MISTAKE! */
sendit(sock, p);
}
}
}
My fatal mistake was in adding that && p->tx
compound conditional to the EPOLLOUT check. I started seeing pollit()
called multiple times a second, and not actually blocking as it waited for something worthwhile to do. A CPU busy-loop, and my battery was not thrilled. (Fans were pretty excited about it though)
To understand why, I had to take a step back and think long and hard about what it means for a network socket to be readable or writable. At first blush, it seems straightforward - can I send a packet to the other side?
Except it's a little more complicated than that, isn't it?
It may surprise you to find out, but userspace applications literally have no way of knowing when the packets are sent. That all happens inside the kernel, and the kernel is very territorial. When you write(2)
to a network socket, the "packet" is just sitting in a buffer somewhere inside the kernel, right next to the network card. It may get sent right away, it may wait for Nagle's Algorithm to run its course. It may wait for the receiving end to open up its TCP window.
When epollwait
checks to see if the socket file descriptor is writable_, it's really checking to see how much space is available in that kernel buffer. If the kernel has the room, for anything (even just a single octet!) epoll will flag it as EPOLLOUT
.
epoll doesn't care about my && p->tx
.
To get the desired behavior, I ended up having to perform a delicate, if commonsensical dance with epoll — when a message is queued up, tell epoll to watch that peer's socket descriptor for writability. Once it is sent, and the queue is empty, tell epoll to knock it off.
Happy Hacking!
]]>This morning, I was trying to spin up two Dell R710s in our lab environment, so that I could install Openstack. Thankfully, these things have iDRACs (remote out-of-band baseboard management controllers over IPMI), so I don't have to crouch down in the supply closet to run the installation.
I have to use a VirtualBox IE6 VM with the --ahem-- security turned way down low, but it beats a sharp stick in the eye.
I'm trying to set up two machines, openstack0
and openstack1
, over their iDRACs. I've got PXE and tFTP all set, complete with all the operating system disks and the requisite configuration to do the installs unattended, but the PXE menu configuration refuses to select a boot target automatically. This is by design.
I've got my IE6 VM up and connected to both iDRACs, logged in, on the virtual consoles, and I've got both of the installations in-flight, when all of the sudden, the console stops responding on exactly one of the iDRACs.
Great. Maybe the management card is flaking. I check the networking in the closet to make sure someone didn't jostle something or need a network drop and scavenge one. Nope. All good there. Most of my co-workers are watching something on youtube in the conference room.
Time to start basic networking troubleshooting.
Can I ping both machines from my Macbook? Check.
Can I ping them from the IE6 VM? Check.
Let's try a port-scan. The iDRAC that won't connect is first on the block:
→ nmap 10.200.0.120
Starting Nmap 7.31 ( https://nmap.org ) at 2017-03-20 10:42 EDT
Nmap scan report for 10.200.0.120
Host is up (0.0086s latency).
Not shown: 994 closed ports
PORT STATE SERVICE
7676/tcp open imqbrokerd
8000/tcp open http-alt
8001/tcp open vcom-tunnel
8002/tcp open teradataordbms
8080/tcp open http-proxy
9999/tcp open abyss
Nmap done: 1 IP address (1 host up) scanned in 0.22 seconds
Well, it's got ports open and listening, but I don't see HTTP (80) or HTTPS (443), which is strange. What about the other (functioning) iDRAC?
→ nmap 10.200.0.121
Starting Nmap 7.31 ( https://nmap.org ) at 2017-03-20 10:42 EDT
Nmap scan report for 10.200.0.121
Host is up (0.012s latency).
Not shown: 995 closed ports
PORT STATE SERVICE
22/tcp open ssh
80/tcp open http
111/tcp open rpcbind
443/tcp open https
5900/tcp open vnc
Nmap done: 1 IP address (1 host up) scanned in 1.48 seconds
Also up, but ... with ... different ... ports.
These IPs (10.200.0.120
and .121
) are both statically assigned in the iDRAC configuration (the one bit of in-front-of-the-hardware setup I suffered through).
On a lark, I checked the ARP table on my macbook (which is also on the 10.200.x.x
segment):
→ ping -c1 10.200.0.120
PING 10.200.0.120 (10.200.0.120): 56 data bytes
64 bytes from 10.200.0.120: icmp_seq=0 ttl=64 time=2.287 ms
--- 10.200.0.120 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 2.287/2.287/2.287/0.000 ms
→ ping -c1 10.200.0.121
PING 10.200.0.121 (10.200.0.121): 56 data bytes
64 bytes from 10.200.0.121: icmp_seq=0 ttl=64 time=1.504 ms
--- 10.200.0.121 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.504/1.504/1.504/0.000 ms
→ arp -a | grep 10.200.0.12
? (10.200.0.120) at fc:f1:36:11:8a:1c on en0 ifscope [ethernet]
? (10.200.0.121) at a4:ba:db:15:2b:95 on en0 ifscope [ethernet]
(The two ping
runs help ensure that the ARP entries haven't aged out of the ARP table.)
Now, I was expecting both MAC addresses to share the same OUI, but they are radically different. Running both of them through Wireshark's online OUI lookup tool reveals the problem:
A4:BA:DB Dell Inc.
FC:F1:36 Samsung Electronics Co.,Ltd
Remember how I told you that no one was mucking with the wiring in the closet because they were watching a video on the Internet in the big conference room? The reason they were in that particular room is because that room has a 75" Samsung flat panel TV.
Oh, a Samsung? Interesting...
Once they were done, I got on the Smasung and took a look at its networking configuration. It's configured for DHCP and sure enough, the DHCP server had handed out my iDRAC IP address.
tl;dr: informing your DHCP server of pool addresses you are going to statically assign is a Best Practice ™
Happy Hacking!
]]>I've been playing with NaCl — specifically the TweetNaCl implementation — in an effort to implement reliable end-to-end encryption in a project of mine, via CurveCP.
NaCl brings powerful and safe security primitives to the table, in the form of a cryptographic box. We are accustomed to dealing with cryptography at a ludicrously low level (a hash function here, a MAC algorithm there). crypto_box
, the heart and soul of NaCl, wraps up best practices, namely:
From one of the research papers on NaCl:
NaCl provides a high-level functioncryptobox
that does everything in
one step, converting a packet into a boxed packet that is protected against
espionage and sabotage. Programmers can use lower-level functions but are
encouraged to usecryptobox
Usage seems straightforward.
First, we have to generate an RSA public / private keypair:
uint8_t public[32];
uint8_t secret[32];
int rc = crypto_box_keypair(public, secret);
assert(rc == 0);
The secret
key will be randomly generated, and then a Curve25519 public key counterpart will be derived.
Easy.
Next up, we use crypto_box()
itself to encrypt our plaintext into a ciphertext buffer. The call signature seems pretty straightforward at first glance:
int
crypto_box(uint8_t *ciphertext,
uint8_t *plaintext, size_t plaintext_len,
uint8_t nonce[24],
uint8_t public[32],
uint8_t secret[32]);
The first argument is a buffer to house the encrypted ciphertext.
The second argument is the buffer to read the plaintext from. It should be plaintext_len
octets long.
The fourth argument is a nonce (number used once), which exists to perturb the ciphertext and allow use of the same keypair without leaking key material. So long as it is unique (a counter will do), we should be good.
The fifth and sixth arguments are the recipient public key and the sender private key, respectively. Since crypto_boxes are part of a communication system, they are intended to be assembled and sender and only readable by the correct recipient.
So, let's give this a shot.
#define MESSAGE "There are strange things done in the midnight sun\n" \\\\
"By the men who toil for gold;\n" \\\\
"The Arctic trails have their secret tales\n" \\\\
"That would make your blood run cold;\n" \\\\
"The Northern Lights have seen queer sights,\n" \\\\
"But the queerest they ever did see\n" \\\\
"Was that night on the marge of Lake Lebarge\n" \\\\
"I cremated Sam McGee.\n"
/* - Robert W. Service */
#define MESSAGE_LEN 304 /* I counted */
int rc;
uint8_t client_pub[32], client_sec[32];
uint8_t server_pub[32], server_sec[32];
uint8_t cipher[256], plain[256];
uint8_t nonce[24];
/* "generate" nonce */
memset(nonce, 0, 24);
/* generate keys */
rc = crypto_box_keypair(client_pub, client_sec); assert(rc == 0);
rc = crypto_box_keypair(server_pub, server_sec); assert(rc == 0);
memcpy(plain, MESSAGE, MESSAGE_LEN);
dump("plaintext, before encryption", plain, MESSAGE_LEN);
/* encipher message from client to server */
rc = crypto_box(cipher, plain, MESSAGE_LEN,
nonce, server_pub, client_sec);
dump("ciphertext", cipher, MESSAGE_LEN);
assert(rc == 0);
And here's the output (full, compilable code is over here on github, as encrypt.c):
plaintext, before encryption
------------------------------------------------
54 68 65 72 65 20 61 72 65 20 73 74 72 61 6e 67
65 20 74 68 69 6e 67 73 20 64 6f 6e 65 20 69 6e
20 74 68 65 20 6d 69 64 6e 69 67 68 74 20 73 75
6e 0a 42 79 20 74 68 65 20 6d 65 6e 20 77 68 6f
20 74 6f 69 6c 20 66 6f 72 20 67 6f 6c 64 3b 0a
54 68 65 20 41 72 63 74 69 63 20 74 72 61 69 6c
73 20 68 61 76 65 20 74 68 65 69 72 20 73 65 63
72 65 74 20 74 61 6c 65 73 0a 54 68 61 74 20 77
6f 75 6c 64 20 6d 61 6b 65 20 79 6f 75 72 20 62
6c 6f 6f 64 20 72 75 6e 20 63 6f 6c 64 3b 0a 54
68 65 20 4e 6f 72 74 68 65 72 6e 20 4c 69 67 68
74 73 20 68 61 76 65 20 73 65 65 6e 20 71 75 65
65 72 20 73 69 67 68 74 73 2c 0a 42 75 74 20 74
68 65 20 71 75 65 65 72 65 73 74 20 74 68 65 79
20 65 76 65 72 20 64 69 64 20 73 65 65 0a 57 61
73 20 74 68 61 74 20 6e 69 67 68 74 20 6f 6e 20
74 68 65 20 6d 61 72 67 65 20 6f 66 20 4c 61 6b
65 20 4c 65 62 61 72 67 65 0a 49 20 63 72 65 6d
61 74 65 64 20 53 61 6d 20 4d 63 47 65 65 2e 0a
------------------------------------------------
ciphertext
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20 08 4b 11 63 5c 5c 39 6c 6d a0 d0 0c 4c 11 e4
36 db 04 89 e9 8d e5 44 e0 3f 51 65 30 1b 90 a5
5c 8b 04 27 00 d3 f2 13 b2 4d 64 96 3e 6f a3 d9
ba 7b 1e aa e9 1f ac ef b8 17 63 4b 40 95 77 d7
39 fa 8b 5a 36 f3 18 d8 cc 12 54 af 45 e0 6a 78
48 3f bb 0e c3 85 12 3b 66 d0 18 1c ff 1d a5 29
2e c8 8f e7 67 08 f8 94 16 bb 1c b8 1d 14 b3 9b
64 2b 4e ec 53 5a 23 22 bf b1 3c 40 f0 5d dc 97
66 26 a5 ad 9d 2f 2c d6 8a 4f c1 57 a1 5a 0c 6f
c1 e1 69 cd 76 f3 00 ee 40 5e d7 eb 38 05 f2 e4
e7 68 72 1b af f6 ae 2d 1e 5d b4 53 dd 10 37 6a
eb c7 dc bc c7 3e b2 55 33 ad fa 03 9b bc 90 66
ed f6 4c 0c a5 c5 a7 d3 05 a1 10 c9 af 53 86 6c
d3 30 3d d5 b4 4a 25 45 16 6f ab 9e ec cb 89 39
b8 57 65 6b e8 87 cd 05 21 96 32 30 53 92 97 6d
e7 32 9e 02 f4 fd dc 5e f5 14 90 d3 84 d0 98 01
c4 7c c5 a4 64 46 74 37 ea 85 0e 65 85 8a e3 f8
2c 1c 77 ce 69 04 cd 80 ab 60 cf e8 e3 c7 35 d6
------------------------------------------------
We have ciphertext! That first row of all zeros seems odd...
Like backups without a restore operation, encrypting data is only half of the story. We have to be able to open that cryptographic box as the receiver, or we won't really have much of a communication system...
Here are the salient bits of decrypt.c:
/* erase all trace of plaintext */
memset(plain, 0, 512);
/* decipher message as server, using client's public key */
rc = crypto_box_open(plain, cipher, MESSAGE_LEN,
nonce, client_pub, server_sec);
dump("plaintext, after decryption", plain, MESSAGE_LEN);
assert(rc == 0);
And here's what happens when we run it:
plaintext, after decryption
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
------------------------------------------------
Assertion failed: (rc == 0), function main, file decrypt.c, line 88.
Abort trap: 6
You didn't really think it would be that easy, did you?
As it turns out, there is a padding requirement briefly mentioned in the docs:
WARNING: Messages in the C NaCl API are 0-padded versions of messages in
the C++ NaCl API. Specifically: The caller must ensure, before calling the
C NaClcryptobox
function, that the firstcryptobox_ZEROBYTES
bytes
of the message m are all 0. Typical higher-level applications will work
with the remaining bytes of the message; note, however, that mlen counts
all of the bytes, including the bytes required to be 0.
crytptoboxZEROBYTES
turns out to be 32.
If we adjust the call to crypto_box
, we should get past the failing assertion and get some (hopefully correct) deciphered plaintext!
Here's our second attempt, decrypt2.c:
memset(plain, 0, 32); /* first 32 octest are ZERO */
memcpy(plain+32, MESSAGE, MESSAGE_LEN); /* then comes the real data */
dump("plaintext, before encryption", plain, MESSAGE_LEN+32);
/* encipher message from client to server */
rc = crypto_box(cipher, plain, MESSAGE_LEN+32,
nonce, server_pub, client_sec);
dump("ciphertext", cipher, MESSAGE_LEN);
assert(rc == 0);
/* erase all trace of plaintext */
memset(plain, 0, 512);
/* decipher message as server, using client's public key */
rc = crypto_box_open(plain, cipher, MESSAGE_LEN+32,
nonce, client_pub, server_sec);
assert(rc == 0);
dump("plaintext, after decryption", plain, MESSAGE_LEN+32);
assert(memcmp(MESSAGE, plain, MESSAGE_LEN) == 0);
plain[MESSAGE_LEN+1] = '\0';
printf("%s\n", plain);
And here's the output (I'm going to start snipping the output, for brevity's sake):
plaintext, before encryption
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
54 68 65 72 65 20 61 72 65 20 73 74 72 61 6e 67
65 20 74 68 69 6e 67 73 20 64 6f 6e 65 20 69 6e
.................. snip .....................
65 20 4c 65 62 61 72 67 65 0a 49 20 63 72 65 6d
61 74 65 64 20 53 61 6d 20 4d 63 47 65 65 2e 0a
------------------------------------------------
ciphertext
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cc ec 2a d9 12 73 24 ed 9f 98 40 37 e4 f7 1e 84
42 c7 09 9e ac c0 ed 52 eb 76 45 79 36 5a 8d b7
.................. snip .....................
d2 7c fd a9 67 53 26 3e e6 e8 2f 31 c6 97 e8 b5
39 00 77 8a 24 36 de 8a ee 0d c3 c9 a6 ee 7a b7
------------------------------------------------
plaintext, after decryption
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
54 68 65 72 65 20 61 72 65 20 73 74 72 61 6e 67
65 20 74 68 69 6e 67 73 20 64 6f 6e 65 20 69 6e
.................. snip .....................
65 20 4c 65 62 61 72 67 65 0a 49 20 63 72 65 6d
61 74 65 64 20 53 61 6d 20 4d 63 47 65 65 2e 0a
------------------------------------------------
Assertion failed: (memcmp(MESSAGE, plain, MESSAGE_LEN) == 0), function main,
file decrypt2.c, line 49.
Abort trap: 6
If you look closely, you'll see that the plaintext lines up, but the plain
buffer doesn't actually match our input message (per the memcmp
assertion failure). That's because that 32 bytes of padding is still there!
Let's remove that. Here's our third (and hopefully final) attempt, decrypt3.c:
/* decipher message as server, using client's public key */
rc = crypto_box_open(plain, cipher, MESSAGE_LEN+32,
nonce, client_pub, server_sec);
assert(rc == 0);
memmove(plain, plain+32, MESSAGE_LEN);
dump("plaintext, after decryption", plain, MESSAGE_LEN);
assert(memcmp(MESSAGE, plain, MESSAGE_LEN) == 0);
plain[MESSAGE_LEN] = '\0';
printf("\n%s\n", plain);
The memmove
call overwrites the 32 zeros from cryptoboxopen()
. Let's see where that leaves us:
plaintext, before encryption
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
54 68 65 72 65 20 61 72 65 20 73 74 72 61 6e 67
65 20 74 68 69 6e 67 73 20 64 6f 6e 65 20 69 6e
.................. snip .....................
65 20 4c 65 62 61 72 67 65 0a 49 20 63 72 65 6d
61 74 65 64 20 53 61 6d 20 4d 63 47 65 65 2e 0a
------------------------------------------------
ciphertext
------------------------------------------------
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cc ec 2a d9 12 73 24 ed 9f 98 40 37 e4 f7 1e 84
42 c7 09 9e ac c0 ed 52 eb 76 45 79 36 5a 8d b7
.................. snip .....................
9d 29 4f 25 1e 6a 0a 99 83 7c e2 2e 94 24 7d b5
3d f7 4f 96 77 3d d9 3c 65 2a 95 52 2a 07 8d 1c
------------------------------------------------
plaintext, after decryption
------------------------------------------------
54 68 65 72 65 20 61 72 65 20 73 74 72 61 6e 67
65 20 74 68 69 6e 67 73 20 64 6f 6e 65 20 69 6e
.................. snip .....................
65 20 4c 65 62 61 72 67 65 0a 49 20 63 72 65 6d
61 74 65 64 20 53 61 6d 20 4d 63 47 65 65 2e 0a
------------------------------------------------
There are strange things done in the midnight sun
By the men who toil for gold;
The Arctic trails have their secret tales
That would make your blood run cold;
The Northern Lights have seen queer sights,
But the queerest they ever did see
Was that night on the marge of Lake Lebarge
I cremated Sam McGee.
Success!
There is a 32-octet padding requirement on the plaintext buffer that you pass to crypto_box
. Internally, the NaCl implementation uses this space to avoid having to allocate memory or use static memory that might involve a cache hit (see Bernstein's paper on cache timing side-channel attacks for the juicy details).
Similarly, the cryptoboxopen
call requires 16 octets of zero padding before the start of the actual ciphertext. This is used in a similar fashion. These padding octets are not part of either the plaintext or the ciphertext, so if you are sending ciphertext across the network, don't forget to remove them!
Not that such a thing would keep someone up until 2:14am...
Happy Hacking!
]]>if (rc < 0) {
/* FIXME: this is wrong */
return -1;
}
People who read my code are often amused (or perhaps horrified) to find little comments like that one, strewn about haphazardly. They litter the code, especially in nascent projects; tiny landmarks calling attention to the fact that James didn't finish something all the way.
I like my little FIXMEs, and I'll tell you why - they free me from the mundane details, so that I can focus on the bigger picture.
Don't mistake me — I love the details. They are the texture of any good bit of technical work. Details are what separate the professionals from the amateurs, the craftspeople from the hacks. Details are the most important part of a production-grade project.
But most of my projects aren't production-grade, and absolutely none of the successful ones start out life that way. Over the past two years, I've taken a toy-first approach to building software, and it's working out better than I had ever hoped.
You see, by assuming every project I start is, first and foremost, just a toy, my programmer's guilt has less power over me. I stop caring about nice debugging, error handling, syslog integration etc. — all things explicitly not related to the project's primary raison d'être. Most of the time, the project scratches the itch it was meant to, I push some code up to Github, and move on with life.
But sometimes, a project takes on a life of its own. Someone starts using it regularly. That someone might even be me. And there, in the codebase, are a bunch of little road signs that remind what I sacrificed to get there.
FIXME: this is wrong, but it feels so right.
]]>I spend a lot of time on a computer, and I have thoroughly customed my environment to be just the way I like it. It's over here on Github, and I use it everywhere.
In late 2015, bash
4.4 or so picked up the ability to display a PS1-style prompt immediately before a command starts. They called the environment variable that houses this prompt definition PS0
. This broke gitprompt something fierce.
Today, I installed A 4.4.x bash on one of my FreeBSD boxen, and got bit.
I patched gitprompt on my local env, but since I don't know where it canonically lives, I couldn't patch it "upstream" — c'est la vie!
tl;dr If you want to upgrade to Bash 4.4.x, and you use gitprompt
, you really want to apply this patch, and if you are setting $PS0
for a custom gitprompt, change that over to PSGIT
.
I wrote two little tools today, rlog
and errno
.
errno
- Looking up Error NumbersThe fact that this utility doesn't really exist had always bothered me, so I wrote it myself after grepping through /usr/include
looking for ESOMETHING
constants and their numeric definitions.
It works like a little something like this:
$ errno 2
ENOENT 2 No such file or directory
$ errno EAFNOSUPPORT EAGAIN EWOULDBLOCK EINVAL
EAFNOSUPPORT 47 Address family not supported by protocol family
EAGAIN 35 Resource temporarily unavailable
EAGAIN 35 Resource temporarily unavailable
EINVAL 22 Invalid argument
$ errno | grep Bad
EBADF 9 Bad file descriptor
EFAULT 14 Bad address
EPROCUNAVAIL 76 Bad procedure for program
EBADEXEC 85 Bad executable (or shared library)
EBADARCH 86 Bad CPU type in executable
EBADMSG 94 Bad message
This can be quite handy when all you get from a program is the error message (i.e. 'Invalid argument') and you're trying to track down which system call is failing. Most kernels ship with pretty good syscall man pages, and they tend to list all of the possible values of errno
that can result, and why.
For example:
ERRORS
The socket() system call fails if:
[EACCES] Permission to create a socket of the specified
type and/or protocol is denied.
[EAFNOSUPPORT] The specified address family is not supported.
[EMFILE] The per-process descriptor table is full.
[ENFILE] The system file table is full.
[ENOBUFS] Insufficient buffer space is available.
The socket cannot be created until sufficient
resources are freed.
[ENOMEM] Insufficient memory was available to fulfill
the request.
[EPROTONOSUPPORT] The protocol type or the specified protocol is
not supported within this domain.
[EPROTOTYPE] The socket type is not supported by the protocol.
If a new protocol family is defined, the socreate process is free
to return any desired error code. The socket() system call will
pass this error code along (even if it is undefined).
Now, if you get ENOMEM
, and print a message via strerror(3)
, you'll get something like this in your output stream:
Cannot allocate memory
With errno
, you can reverse that process and get back the ENOMEM
constant, with ease:
errno | grep 'Cannot allocate memory'
ENOMEM 12 Cannot allocate memory
Hey, look at that! It's ENOMEM
. Incidentally, on my Macbook, that maps to the numeric error code 12. Neat.
The errno
utility also lets you "preview" the error message a given code will produce, which can be helpful in supporting lots of different platforms.
The code for errno
is over here, on Github
rlog
- Ring Buffer LoggingA ring buffer, or circular buffer is a constant-space data structure that prioritizes newer things over older things. rlog
is a Ring Buffer Logging utility that brings an innovative twist to the collection, retrieval and longterm obligation of debug logging.
Ah, debugging, that glorious panacea of the operations world.
If only I had debugging mode turned on, I bet I could get to the bottom of
this weird problem! But turning it on means restarting the daemon, and I
may not be able to reproduce the problem once that happens!
Oh what's a cloud engineer to do!?
One solution, popular with people who like to pay for lots and lots of storage and then never read the logs, is to just log everything, all the time. They set up centralized syslog, Hadoop, Elastic Search, Cassandra, or even -gasp- MongoDD. After a few weeks, there is so much data to look through, it's pointless to even try!
Rather than do that, I wrote rlog
.
rlog
reads every line from standard input, and datestamps each. It allocates enough space to retain 2048 messages. When the 2049th message is received, the 1st messages is dropped. The 2050th messages pushes out the 2nd message, and so on.
When a client connects to rlog
(usually via 127.0.0.1:1040
), they receive the most recent 2048 messages immediately (hence the datestamps), and then receive future messages as they are received by rlog
itself.
It's like tail -f
, except (a) there is no file and (b) it takes up no disk space.
I'll be using rlog
to provide high-resolution debugging output from live production systems. When I need the diagnostics, I hook up a nc
process to the port. When I don't, it's chewing a (small) constant amount of RAM. Win-win!
rlog
is also over at Github
They can be sharp. They don't take a lot of time to build. The low time investment makes them more malleable; easier to fix; improve; expand.
Happy Hacking!
]]>In the world of Software and System Implementation, there are creators, and then there are critics.
Creators see a problem and they go build something to solve it.
Critics see that solution and poke holes in it.
Creators see a broken thing and they work up a patch to fix it.
Critics see the fix as klugey, an inexcusable hack.
Creators build the tools they need now.
Critics won't accept working, today; they want perfect, some day.
I recently showed off a preliminary version of a small side-project of mine. My hope was to get the audience excited about the prospects and maybe provide some useful feedback on featureset and direction. The core features were all there, but some aspects of the implementation were definitely lacking. The project was four days old.
One person in the audience kept laughing about the name, adding syllables until it sounded like a certain deadly viral outbreak. Someone else wanted to know why I hadn't implemented features X, Y, and Z, and had I looked at <insert random library here>, because "they've already done this."
That's criticism, not creation.
It's easy to criticize. Go build something.
]]>Bloom filters let you be lazy, and that makes them awesome.
Fundamentally, a Bloom filter is a probabilistic data structure that can answer the question "is X in set S?" with one of two answers:
If that sounds not quite useful, let's go build a NoSQL database storage engine. It'll be fun, and it won't take that long at all!
Let's start with some domain models.
An object is a collection attributes, with values.
Our access pattern analysis (yes, we did one) indicates that most objects are created, referenced immediately and then accessed infrequently as they age. To maximize our I/O performance, we will store sets of objects in separate blocks, each of which lives in a file on-disk, and gets memory-mapped in when needed.
A naïve data architecture might look something like this:
(We're going to assume that the blocks themselves are structured sanely, but we'll skip the particulars.)
Finding an object is straightforward. Start at the the first block, searching each sequentially until you find an object that has the named attributes, with the requisite values. When you reach the end of one block, continue to the next
. If you hit the NULL
at the end, the object doesn't exist.
Functional, yes. Performant, hell no.
The problem stems from the non-uniform characteristic of each block — they can contain all kinds of objects. These objects may be so dissimilar from our search criteria that it's no use even scanning the block. If only we had a compact way of representing the attributes present in each blocks object set.
Ooh! Bloom filters!
If we stuff a moderately sized Bloom filter into our block structure (probably after next
, but before you get to the data itself), we can do a quick gut check to see if we should even bother scanning. Hey Bloom filter, does the block even have the attributes we're searching on?
The no/maybe nature of the answers you get from a Bloom filter don't hurt us too much here. For starters, false negatives
, where the filter mistakenly says that an attribute is definitely not present, are prohibited by the underpinning math. That means no matter what, we'll find all of relevant blocks. The ambiguity of the it might be here answer is also not a problem. If the filter mistakenly infers that the attribute is present when in fact it isn't, we end up scanning the block needlessly — but it doesn't affect the correctness of the search algorithm. These mistakes, by the way, are called false positives
, and we can control their chance of occurrence by tuning the Bloom filter.
Bloom filters were proposed by Burton Bloom, in his 1970 paper, Space/Time Trade-offs in Hash Coding With Allowable Errors. In it, he proposes the use of imperfect hash-coding as an acceptable substitute where perfect hashing would require too much memory.
As an example use case, Bloom turns to the English language. 90% of words in English can be hyphenated correctly using a few simple (low cost) rules. The other 10% require more costly (computationally speaking) measures. So Bloom proposed a solution whereby it would be simple to determine if a single element of a set was either (a) definitely not in the 10% or (b) possibly in the 10%, using probability and some clever math.
The basic idea is this: using a couple of different hash functions, calculate a bunch of different hashed values. Use those hashed values as indices into an array of bits, and set those positions to 1. Then, to test for subset membership, repeat the hashing operations, and see if the bits at all indices are non-zero.
Let's illustrate, graphically.
We start with a 16-element bit vector. Each box above is a single slot in the vector, and they are all empty.
Here, we insert key-the-first
into the filter. We do so by calculating three hashed values, using three different hash functions, \(h_1\), \(h_2\) and \(h_3\). We get the values 2, 11, and 14, and we set the bits at those three positions in our bit vector.
Next, we insert key-the-second
, using the same process. Calculate the three hashed values (5, 11, and 15), and set the bit vector positions appropriately. Note that \(h_2\) collided on this particular key — it was bound to happen eventually, and it's actually why we triangulate with more than one hashing function.
Checking keys is similar:
To see if key-the-first
is in the filter set, we again calculate the hashed values using our hashing functions. However, instead of modifying the bit vector, we're just going to query it. Since all three bits (2, 11, and 14) are set, we can say that key-the-first
might be in the filter set.
Conversely, looking for a key that we have not yet added to the filter set:
fails. We calculate the hashed values to be 0, 2, and 14, but since not all of those bit vector slots have been set, we can guarantee that key-the-third
is definitely not in the filter set.
Why can't a Bloom filter give a definitive positive answer? Because there just isn't enough information. Intuitively, take note that the size of the Bloom filter bit vector is fixed, but the key space is infinite — eventually our hash functions will collide so much that we start getting false positives
as residuals from prior insertion. All it takes is three collisions, across the three hash functions we are using to accidentally set the all three bit vector positions to 1.
We can actually calculate the probability of getting a false positive, as a function of the number of hash functions, \(k\), the size of the bit vector, \(m\), and the number of items expected to be inserted into the filter set, \(n\):
$$f = (1 - e^{-km/n})^k$$
(For more rigorous maths, check the literature)
Armed with the false positive probability function, we can not only calculate the false positive rate for any Bloom filter configuration, but we can also calculate the ideal value for \(k\), given the ratio \(m/n\).
Broder and Mitzenmacher do just that by taking the derivative of \(f\), and find:
$$k_{ideal} = ln({2m/n})$$
Bloom filter implementations like this one can use that equation to determine the correct \(k\) given a bloom factor represented as an integer floor/ceiling of \(m/n\).
So, if we modify the data architecture of our NoSQL storage engine to look like this:
We can modify our search algorithm to work thusly:
bloom filter
for presenceTerminating, of course, when next
is NULL
.
We incur a bit more overhead on the write side, since our insertion process for new objects becomes:
bloom filter
.This can be batched using something like an LSM-tree to bolster insertion performance, at the cost of either durability or the introduction of a write-ahead log. We can even pull the Bloom filter out of each block and put it in special indexing blocks that can remain memory-mapped, further minimizing disk access. And it's all possible thanks to Bloom filters!
If you can't get enough of research papers, this section is for you!
I love the concept of assertions, little bits of code that exist to make sure that all the other bits of code are playing by the rules. Contrast that with intentions, which aren't in code at all, often existing only in the documentation or, worse, in the original programmer's head.
Consider this function to count the length of a '0'
-terminated string:
int _strlen(const char *s)
{
int n = 0;
while (*s++)
n++;
return n;
}
There's an intention here; the caller really shouldn't pass NULL
for the s
parameter. The first time the code tests the conditional in the while loop, the program is going to segfault.
Let's turn that into an assertion:
#include <assert.h>
int _strlen(const char *s)
{
assert(s != NULL);
int n = 0;
while (*s++)
n++;
return n;
}
Now we have an assertion, and the program will check itself to make sure that some other part of the program didn't accidentally try to calculate the length of NULL
.
Of course, the only respectable course of action to pursue when an assertion is violated is to abort the program, but at least you get an error message to the effect of "assertion failed".
In fact, let's try it out by deliberately sabotaging ourselves:
#include <assert.h>
#include <stdio.h>
int _strlen(const char *s)
{
assert(s != NULL);
int n = 0;
while (*s++)
n++;
return n;
}
int main(int argc, char **argv)
{
_strlen(NULL); /* should fail */
return 0;
}
And when we run it?
→ ./traditional
Assertion failed: (s != NULL), function _strlen, file traditional.c, line 6.
Abort trap: 6
That's exactly what we want to see. If we run a sufficient battery of tests against our function, we should see the assertion bomb out, and our tests fail.
But assert()
has some problems of its own, which we should talk about.
assert()
I see two main issues: reliability and messaging.
1. It can be disabled at build time.
You can't rely on your assert()
-based assertions to actually fire. Whoever is building your software could just set CPPFLAGS=-DNDEBUG
and disable them altogether.
From the assert(3)
man page:
The assert() macro may be removed at compile time with the
cc(1) option -DNDEBUG.
2. Messages printed on assertion failure are pretty basic.
Sure, you get the function, source file and line number in the error message, along with the test itself (s != NULL
), but you can't add an explanation for the human operators who will invariably see this message in the logs some day.
Even having the function / file / line number is of dubious value, because code changes over time. Functions get renamed, files split or merged. The assertion on line 432 might be on line 467 in v1.0.1, replaced in v1.1.9, and removed outright in v2.0.0. All of these factors conspire to make it difficult to trace down problems even in F/OSS software where you have unfettered access to the source code!
insist()
I wrote a replacement assertion macro that I call insist()
:
#include "insist.h"
int _strlen(const char *s)
{
insist(s != NULL, "_strlen(NULL) is undefined");
int n = 0;
while (*s++)
n++;
return n;
}
Disabling it is harder (it's still possible, but it won't be done on accident by a well-meaning package maintainer), and you get to specify a mesage that prints when the assertion fails.
Hopefully you find it useful. You can get the code here. It's licensed MIT, so you can embed it in your project with almost no restrictions. It's also really small (<100 lines, including the copyright notice!)
Happy Hacking!
]]>Modern CPU architectures reward parallel / concurrent programs with higher throughput. Gone are the days of writing single-threaded, serially executed code—it just doesn't scale. As chip vendors pack more and more logical cores onto the silicon, this performance gap widens.
Threading to the rescue! Parallelize your solution, spin up a bunch of concurrent threads, and go. If only it were that simple. Too bad there's data.
Data ruins everything. TV. The word “big”. Dating. Political debates. Naïve parallelization of innately serial algorithms.
The problem with data is that we have to use it, often from multiple concurrent threads, without introducing TOCTOU (Time of Check, Time of Use) problems or other race conditions.
So our program grows some mutex (mut-ual ex-clusion) locks. Whenever a thread wants to read or write to a shared bit of data, it must acquire the lock. Since only one thread can hold the lock at any given time, we're golden. The program is correct and everyone is happy.
Except that the performance suffers.
This is a classic trade-off in computer science. Go fast / be safe. Pick one. Or, to put it in Firefly terms: you can increase speed if you run without core containment, but that's tantamount to suicide. Boo-yah. Firefly reference.
Back to our performance problem. The root cause of observed slowdown is the bottleneck caused by the mutex lock. By definition, the mutex lock serializes parts of the program, which introduces bottlenecks. Readers can't read while a writer is writing. But readers can totally read while other readers are reading. The next evolution of the program introduces reader/writer locks.
Reader/writer locks split the locking activity into two parts. A write lock works like our previous lock - only one thread can hold the write lock at any single instant. The read lock, on the other hand, (a) can be held as long as no one has the write lock, (b) can be held by multiple readers concurrently and (c) precludes any thread from obtaining the write lock.
The upshot of this new approach is that reader threads are not held back by other readers (a situation called starvation), but we still don't introduce any data race conditions because all reads will be serialized with any writes.
For read-heavy workloads, the optimization usually ends at the reader-writer lock step. Improving the throughput of writes rarely improves performance since the bulk of the work revolves around reads. For other workloads, including write-heavy and split read/write, optimizing writes is essential.
Enter Read-Copy Update, or RCU. A constrained form of multiversion concurrency control (MVCC for those TLA fans out there), RCU solves the scaling problem by trading convergence for availability (remember the CAP theorem?). The premise is simple: as long as readers get an consistent view of the shared data, does it really matter that they get the most up-to-date version?
Consider a linked list, that looks like this:
Under a locking strategy, inserting a new item would wait until there were no readers before going ahead with modifications. What if, instead, we could ensure that a reader got one list or the other, but not some weird in-between version? That is, a reader would see either:
or:
The kicker is that either scenario is perfectly valid.
Without getting into the nitty-gritty implementation details (that's a different post altogether), this is what RCU nets you: the ability to do updates with minimal serialization between readers.
In grossly oversimplified terms, RCU performs atomic modifications on a shared data structure such that any reader can traverse the data structure at any time, without getting a corrupted view. Let's return to our A → C → D list, with two reader threads traversing the list at different points.
Without synchronizing with these two readers, an updater thread can create a new list item, B, and half-splice it into the list by linking its next
pointer to C:
Nothing has changed for either reader. The first reader is still set to traverse A → C → D, and the second reader will finish traversing C → D (having already seen A).
The next step (which is also atomic) replaces the next
pointer of A with a pointer to B, thereby completing the insert operation:
Now we've affected the readers. If the first reader is scheduled after the atomic next
-swap, it will traverse A → B → C → D. If it gets scheduled before the swap, it will see A → C → D. No matter what, the second reader is not affected by the operation, and will see the entire list as A → C → D.
Removal is similar, except that the operations play out in reverse, and there's a small housekeeping task called reclamation or synchronization.
At this point in our example, we've completed our insertion operator, so the full list is A → B → C → D. The second reader has completed its traversal of the list (no orange arrow). The first reader has advanced to B (and become our new second arrow). We also now have a new reader starting at the head of the list (in green).
Let's remove B! The first thing we are going to do is re-link A directly to C by (atomically) swapping its next
pointer appropriately:
Now we can see the same “dual-state” phenomenon we saw before with insertion; the first reader will see the post-remove version of the list (A → C → D), while the second reader finishes out B → C → D having already seen A.
Assume that new readers will have to start at the head of the list (A) and proceed linearly—that is, no random access, and no pointer aliasing is allowed. If we can figure out when all existing readers have lost access to the deleted item B (we can), then we can free the list item and any associated memory.
Depending on how you implement this, the writer thread can synchronize on the RCU-protected data structure, waiting for all readers to lose visibility into the data structure's interior, or the writer thread can defer that task to a reclamation thread that periodically synchronizes.
If you read through the last few paragraphs but couldn't help thinking “gee, this sounds a lot like garbage collection semantics,” you would be spot on.
Garbage collection is fundamentally about letting other parts of the system (namely, programmers) forget to clean up after themselves, by explicitly freeing resources that can no longer be reached.
RCU executes this reachability analysis through the use of read-side critical sections, quiescent states and grace periods.
A read-side critical section is a window in both time and code during which a reader may retain access to internals of the shared data. An RCU-aware list traversal algorithm enters its read-side critical section just before reading the head pointer, and exits after processing the last list item.
At any point during the critical section, we can't know precisely what part of the list is under observation (the first item? the last? who knows!). We do however know that mucking about with any part of the shared structure will lead to race conditions. We've traded accuracy for speed.
A quiescent state is (for readers) everywhere / everywhen that isn't a read-side critical section. When a thread enters a quiescent state, it is a guarantee that all previous operations on the shared data have completed, and it is therefore safe to go mucking about with the internals.
Closely related to quiescent states are grace periods. A grace period starts the moment we perform a destructive operation, and ends once each thread has been in a quiescent state at least once. At that point, it is provably safe to reclaim garbage.
In a picture:
We have five readers (\(R_1\)–\(R_5\)), each accessing a shared data structure via RCU semantics. Time proceeds from left to right.
At \(T_0\), an update operation is initiated, and a grace period begins. Readers \(R_1\) and \(R_5\) are concurrently in read-side critical sections, so each will have to enter a quiescent state before the grace period will end. At \(T_1\), reader \(R_1\) enters a quiescent state. Since \(R_5\) was already in a quiescent state, this ends the grace period.
Note that for the first grace period, neither \(R_2\), nor \(R_4\) have any bearing on the grace period. \(R_4\) ends before the grace period does, but both sections start after the destructive operation has been performed, so they are provably unable to see the removed node.
The second grace period, highlight in blue above, stretches from time \(T_2\) – \(T_3\). When it starts, all readers except for \(R_1\) are in read-side critical sections. By the time \(R_1\) enters its critical section, the change has been completed, and it has no effect on when the grace period ends.
In fact, only threads that are in critical sections when a grace period starts can prolong the grace period. However, once an “involved” thread enters a quiescent state, it no longer holds any power over the grace period. Intuitively, this makes sense; a thread in a quiescent state has (by definition) stopped interacting with the internals of the shared data, and cannot possibly hold copies of any internal pointers.
You can also see this principle at work by looking at the boxes in the diagram. Go ahead, I'll wait.
Do you see what \(R_4\) is doing? It is very quickly waffling from quiescent state back into a critical section. It manages to bounce back and forth three times before our slow \(R_5\) reader quiesces. But the last four of \(R_4\)'s critical sections have no bearing on the grace period.
There's a bunch of interesting math and bit twiddling tricks involved in implementing RCU on a real machine. I hope to get to those in my next post, which delves into the nuts and bolts of implementing RCU in a real-world C program.
The best (and most academic) paper I've found so far on RCU is User-Level Implementations of Read-Copy Update (14pp). The author list includes Paul McKenney—you'll see his name a lot in the literature. There's also supplemental material (12pp) available.
The second-best paper is Read-Copy Update (22pp). It's a lot less generic, and geared specifically towards implementation inside of an operating system kernel; namely, Linux. (It was published at the Ottawa Linux Symposium, in 2001).
For a more accessible, Linux-specific treatment, including how it is used, kernel calling conventions, etc., check out this three part series by the fine folks over at LWN.
If you're in for a longer read, or just really like the subject matter, Paul McKenney has written volumes. You may want to check out his dissertation (380pp), His book, Is Parallel Programming Hard, And, If So, What Can You Do About It? (477pp) discusses RCU and a wealth of other parallel programming topics (also forkable on Github).
And if you're really, really into it, here's some further, further reading. You've been warned, and if this costs you the next few weekends of your life, it's not my problem. ^_^
Happy Hacking!
]]>I spend a fair amount of my day interacting with git. I keep operational configurations (i.e. BOSH deployments) in git. I keep all of my code in git. Hell, this blog is even in git (as is the software that runs it).
In my consulting work, I've introduced git to hundreds of people and dozens of teams, and here's what I've learned:
You need to customize git just a little.
This is what my prompt looks like, halfway through writing this essay:
Everything before the )
is from a little utility called gitprompt. It's made up of the following pieces:
master
- Current branch24c0fa
- Current HEAD commit+1*2f4
- Working copy stateI find it useful to know, at a glance, what branch I'm on, and having the HEAD commit ID right there in the prompt has come in handy on more than one occasion - "Are you sure you've done a git pull? I'm at 72b147a on the fix-things branch..."
Above all else, I use that working copy state more than anything else. It's really the output of a git status
, condensed down to a series of single-character flags and numbers. In the above example, +1
indicates that one file has been staged for commit (via git add
), the *2
shows me that two tracked files have unstaged changes, and the f4
means there are four new (untracked) files that are not gitignore'd.
This often saves me from running a git status
.
To use gitprompt, you'll have to download a copy to ~/bin
, and add the following to your ~/.bashrc
:
type git >/dev/null 2>&1
if [[ $? == 0 ]]; then
export PS0="%{%[\e[1;34m%]%b%[\e[00m%]:%[\e[1;33m%]%i%[\e[00m%]%}%{%[\e[1;31m%]%c%u%f%t%[\e[00m%]) %}$PS1 "
export PROMPT_COMMAND='export PS1=$($HOME/bin/gitprompt c=\+ u=\* statuscount=1)'
fi
(PROMPT_COMMAND
is run before every prompt; in this configuration, gitprompt consumes the PS0
environment variable, interprets the parts between %{...}%
and then sets PS1
accordingly. Neat, huh?)
Git itself is surprisingly easy to configure. While you can run the git-config ...
command to set everything, it's easier to just edit your /.gitconfig
by hand. Here's (part of) mine:
[user]
name = James Hunt
email = ...
[core]
excludesFile = ~/.gitignore
[push]
default = simple
[color]
ui = auto
diff = auto
[color "diff"]
new = green bold
old = red bold
meta = white
func = magenta bold
frag = yellow bold
whitespace = blue reverse
[color "status"]
added = green bold
changed = red bold
untracked = cyan bold
The [user]
section is all about you. This is where you set your display name and email address, as they will appear in commit messages you author.
The [core]
section customizes the core behavior of git itself. I use a global .gitignore
file for common things that I know I never want committed, like object code files (.o
), backup files (~
and *.bak
), etc.
The [push]
section manages how git push
behaves. Setting the default push strategy to simple accepts the default behavior of Git 2.0+, is the safest strategy that causes the least amount of mayhem, and is generally suited to all workflows. See git-config(1)
for more details.
The [color*]
sections govern terminal colorization. I am a visual person, and I like to see things in terms of color as well as text. Out of the box, commands like git diff
and git status
don't take advantage of modern terminal emulators and print everything in the default colors.
I should point out that I use white foreground text on a black background, and my color choices arise from that aesthetic. If you prefer black-on-white, you probably want to pick a different color scheme.
Here's what git status
looks like, all colorized:
At a glance, I can tell if there's anything changed (in red), anything to commit (in green) and anything new (light blue).
My ~/.gitconfig
also contains my git aliases. These aliases result from years of using git, and have been lovingly crafted and tweaked over thousands if not millions of invocations of git.
[alias]
st = status
ci = commit
br = branch
co = checkout
df = diff
dfc = diff --cached
dfp = diff --stat -p
who = shortlog -s --
lg = log -n 20 --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr)%Creset %an' --abbrev-commit --date=relative
The first five are simple elisions for the common set of commands. I type run git status
far too often to have to type those additional four keystrokes.
NOTE: I have met people who define even shorter shell aliases, like gs='git status'
or gl='git log'
. I work with people who have aliases like gcam='git commit -a -m'
. I don't do that because it obscures the fact that the command being called is in fact git. Do what works for you.
Looking at diffs is a fundamental git thing, so I give git diff
its own alias. I also define two variations on the base diff: git dfc
and git dfp
.
git dfc
shows the difference between what's been staged and what's been committed. I use this regularly when crafting commit messages, to make sure that what I think I'm committing is what I'm actually committing, and that the changes staged are coherent and related.
git dfp
is a regular diff, but adds a small header listing files, the number of lines added / removed, and a textuo-visual representation of the change.
Here's an example:
git who
is something I use to get a feel for contributor share, i.e. who has the most code / commits and therefore may be best to approach with questions, thoughts and ideas. This isn't always correct — Go repositories in particular skew towards whoever introduces the most voluminous project dependencies, but it's a start.
Finally, git lg
is my favorite. It's an elision of git log
that saves exactly one character, until you realize that there's a ton of options in that alias definition. Those options make git lg
unique enough in its own right. Here's an example:
Here's what I like most about git lg
:
The great thing about all of these aliases is that they are partial. You can give them arguments and everything works as expected:
git df origin/branch-name
git ci that/stuff
git lg origin/master..HEAD
etc.
If you spend a measurable amount of your day interacting with git, you owe it to yourself to customize your environment to make life easier. I hope you can find some value in my configuration and aliases, but more than that, I hope you go out, read the man pages, play with the knobs and levers, and find something that works for you.
Happy Hacking!
]]>When you do what I do, you write a lot of commit messages, but you read a lot more, and a lot of them are terrible.
It's easy to write bad commit messages. I get it. You've just found (and fixed!) a particularly nasty bug, and the last thing on your mind is re-hashing all that context you've still got bouncing around your head.
Or maybe you just made it through a truly harrowing deployment, fraught with rework, typos, and frustration. Who wants to revisit all of those failures? Just commit the manifests / configs / whatever and get on with your life.
The problem with bad commit messages is that they last. Some day, someone is going to be looking through some obscure corner of the git repository, wondering who made this change and why. That person may be you. And all you have to offer is...
commit e686205f97f47b45ccd2c0a575e4b5191134947a
Author: Past Me <hahaha@example.com>
Date: Mon Aug 15 10:54:43 2016 -0400
fix some stuff, i guess.
That's not what you want to see when you're in the middle of a git bisect
call chasing down a bug, or trying to git blame
a breaking change in search of a fix.
Every commit needs a good summary. It should be short, succinct and accurately describe the essence of the changeset. Try to keep it under 72 characters.
After the summary, leave a blank line, and then write a more in-depth and detailed description of what's actually in the commit. This is where context goes. Why are you making this change? Is there a ticket? Any other context that might be important to understanding the changes should be included.
On the flip side, don't pontificate. Over-describing is just as bad (and useless) as under-describing. Say what you need to say, and then close the editor.
For very small or completely self-explanatory commits, you can skip the details. This situation is more rare than you'd think.
I see lots of people spend hours in their editor of choice, only to cop out on the commit with git commit -m 'did stuff'
Use your editor. git commit
will invoke your $EDITOR
by default, and let you write a more complete message.
$ git commit
Add UTF-8 String Functions
All standard str* functions now have u8_* variants that perform the
same operations, but properly handle multi-byte UTF-8 encodings of
Unicode code points. Call signatures are identical in all cases.
Fixes #1268
# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch master
# Your branch is ahead of 'origin/master' by 3 commits.
# (use "git push" to publish your local commits)
#
# Changes to be committed:
# modified: include/str.h
# modified: src/strcase.c
# modified: src/strsearch.c
# modified: src/strmanip.c
# modified: README
#
# Untracked files:
# personal-notes
#
vim users get some additional help from a git-aware syntax higlighting mode that will visually separate your summary from the long description, and make it easier to see what's to be committed in this go-round.
If you exit out of the commit message editing session without saving (i.e. :q!
), git will abort the commit. This has saved me a fair amount of grief over the years when I realize I'm committing multiple changesets in one commit — something you ought to try to avoid.
Back in January, I joined Massdrop, bought an Infinity Ergo Dox split keyboard, and got hooked on all things mechanical keyboards.
I used the default layout for a few days before I started experimenting with custom layouts. What follow are the notes I'm leaving for myself. If you have stumbled across this page and find it useful, great! If not, there is a vibrant (and helpful!) community of keyboard enthusiasts on Reddit, Geekhack and Deskthority.
As far as I can see, there's two: one from Massdrop and another from Input Club. They both seem to do a decent job of letting you graphically configure your keyboard layout, and I've never had issues using the firmware generated by either.
I personally prefer I:C's configurator, because it shows you a single layout graphic, with all of the layers super-imposed (see above), rather than repeating the diagram once per layer.
The Ergo Dox derives much of its power and flexibility through layering. Why have just one set of key mappings when we have a programmable microcontroller? Layers give us lots (up to 8!) virtual keyboard mappings (buttons → keycodes), activated through firmware-intercepted keypress events.
Layers can be stacked, and keypresses "fall-through" to lower layers until they hit a mapping configuration. This makes it trivial to overlay your main layout with a partial layout. For instance, you can set up a layer with just the FN
keys in place of the top-row number keys, and your QWERTY keys act like normal.
Neat. So how do you activate a layer? Via one of these:
Hrm. That's a lot of options, each with its own peculiar behavior.
ƒ[num]
is a temporary modifier, like Ctrl
, Alt
or Shift
. You hold it down, and the keyboard activates that layer for as long as you keep holding it down. If you wanted to have a "math symbols" layer, but never really needed to type a string of symbols in a row, you can use ƒ[num]
LOCK-[num]
is the CapsLock
to ƒ[num]
's Shift
: it toggles the activation of the given layer on or off. Lots of people use LOCK-[num]
to switch to a superimposed numpad (tenkey) or even to flip QWERTY → COLEMAK or QWERTY → DVORAK.
LATCH-[num]
doesn't really have an analog on other keyboards. It sits in the middle-ground between ƒ[num]
's temporary activation and LOCK-[num]
's toggling behavior — it activates the layer until a keypress is seen. The mapping will be consulted with the configured layer active, and then that layer will be deactivated. You can think of it as a sort of disconnected ƒ[num]
, where you don't have to hold down the activator key to use the layer.
NEXT-ƒ
(and its sister PREV-ƒ
) cycle the active layer through the available layers. I'm still a bit fuzzy on this one, as I haven't had the time to truly play with it. I've taken to assigning these to the bottom- and outer-most keys on either side (usually set to LGUI
and RGUI
) and it's nice. If you've got mappings defined on a layer, NEXT-ƒ
and PREV-ƒ
happily cycle through the layers (1-2-3-etc.) and the LCD HUD changes color and displays the layer number.
Seriously. Without the FLASH
key (preferably on both halves), you have to resort to disassembling the case to access the Teensy flash pushbutton on the bottom of the PCB. That's a serious PITA, so just configure a FLASH
on each side, k? You can put it in its own layer if you're worried about key real estate...
The actual process of flashing the firmware is pretty straightforward:
$ brew install dfu-util
--- ---
$ dfu-util -D /path/to/firmware.bin
You're supposed to plug each half in directly and flash; the I:C configurator should provide both left- and right-hand side firmware images.
I'd love to be able to do this. I probably have to dig deeper into the firmware world, and stop relying on the configurators so much.
]]>Module systems provide three things to a programming environment:
In this essay, I expound on these topics, and elucidate my own thoughts (as both a professional programmer and aspiring language designer) on what constitutes good module system design.
Consider the following function, which implements a reservoir sampling algorithm:
(fn reservoir (ll i value)
(let (r (rand 0 1)
l (len ll))
(if (< i l)
(set (nth ll i) value)
(if (< r (/ l i))
(set (nth ll (rand 0 l)) value)))))
Ideally, we'd like to not have to copy that code from program to program (especially since it has a bug that we eventually need to fix). Instead, we'd rather do something like this:
(use stats/sample)
(fn main (args)
(let (ll '())
(each (i v args)
(reservoir ll i v))
(printf "R = [%v]\n" ll)))
A good module system lets you do that. Later, when we find that bug I alluded to earlier, we can fix the module and distribute a new version. Calling programs benefit when they update to that version, without having to patch the bug directly.
Encapsulation is very important. The module system must let module authors "hold back" some functionality as pertinent only to the implementation of the module, not its interface. Simply put, the interface is what the outside world see, the implementation is how the job gets done.
Consider if we refactor our function a little, by introducing a small utility to govern whether or not a new sample replaces and older one:
(module stats/sample)
(fn replace? (l i)
(< (rand 0 1) (/ l i)))
(fn reservoir (ll i value)
(let (l (len ll))
(if (< i l)
(set (nth ll i) value)
(if (replace? l i)
(set (nth ll (rand 0 l)) value)))))
Now we have a problem. Our new (replace?)
function is an implementation detail of our (reservoir)
function, and we don't want callers to use it directly.
For Snook, the completely 100% hypothetical Lisp-like language I made up for this essay, we can introduce a decorator form that signals to the compiler that a given function ought not be seen by callers:
(module stats/sample)
(private
(fn replace? (l i)
(< (rand 0 1) (/ l i))))
;; ... etc. ...
With (replace?)
marked private, any attempt to call it directly from programs or other modules will result in an error that the method is not defined.
Assume for the sake of illustration that we are writing a sensor reading collection program for a water treatment facility, in Snook. Because it's 2016, we're using an HTTP-enabled SCADA system for managing our treatment facility data. So we write up the following (inside of its own module, of course):
(module scada)
(use net/http)
(fn reservoir (ip username password)
(connect
(format "https://%s:81234" ip) ;; address of HTTP server
(basic-auth username password))) ;; HTTP BasicAuth header
(fn metric (endpoint name)
(let (r (get endpoint
(format "/v1/metric/%s" name)))
(if (= 200 (status-code r))
(body r)
(panic "request failed"))))
That neatly encapsulates how we connect to one of the reservoirs via its SCADA endpoint, and how we pull metrics out of it. However, we have set ourselves up for almost certain failure.
What happens when we try to use these two modules together, à la:
(use scada) ;; defines (reservoir)
(use stats/sample) ;; ... also defines (reservoir)
Chaos. Havoc. Uncertainty.
A good module system mandates namespacing to sidestep this issue entirely. Namespaces disambiguate which function, in which module you are interested in calling. By mandating it, module authors don't have to be as vigilant about name reuse.
Snook does it by prefixing the imported symbols (the functions) with the full name of the module.
(use scada)(use stats/sample)(fn main (args) (let (metrics (listof 10 nil) ip (nth args 0) user (nth args 1) pass (nth args 2) resv (scada/reservoir ip user pass))
;; collect lots of metrics (keep 10 samples)
(repeat (i 2000)
(let (v (scada/metric resv "pressure"))
(stats/sample/reservoir metrics i v)))
;; print the 10 samples we kept
(printf "[%s] temp: %v\n" ip metrics)))
A slight improvement on this scheme is to allow the programmer to specify their own namespace, to be used for the scope of the file:
(use scada)(use stats/sample st)(fn main (args) (let (metrics (listof 10 nil) ip (nth args 0) user (nth args 1) pass (nth args 2) resv (scada/reservoir ip user pass))
;; collect lots of metrics (keep 10 samples)
(repeat (i 2000)
(let (v (scada/metric resv "pressure"))
(st/reservoir metrics i v))) ;; simpler!
;; print the 10 samples we kept
(printf "[%s] temp: %v\n" ip metrics)))
(of course, this kind of namespace feature means we have to rewrite our scada
module so that it uses the net/http/
prefix appropriately.)
Here are elaborations on some notes I took while thinking about module systems.
The (use ...)
construct augments the calling environment by defining new symbols (using the prefix notation) for all exported functions — that is, those that have not been explicitly marked as private
. Similar provisions can be made for variable bindings and constants.
To safeguard the integrity of modules, their bindings are fixed at compilation, and cannot be monkey-patched at runtime. To safeguard programmer sanity, Snook forbids rebinding of symbols imported via a (use ...)
construct. If it were to allow it, it would only be a shadowing rebinding; it would have absolutely no bearing on the original module.
A side-effect of this decision is that exported variables are effectively read-only. This is good, since module-level variables are usually abused as a form of "acceptable" global variables. Module-level constants are unaffected by the no-rebinding rule.
Dependency tracking and resolution, while not explicitly part of the module system proper, is important to the utility and viability of the module system — indeed the language itself. If no one can find, or reliably source a module, what good is the module system?
I have several thoughts on this, that I will be committing to prose before long.
One of the design goals of Snook (keeping in mind that it is entirely fictional) is to facilitate small, self-contained, static executable binaries for a variety of target processors. The module system must support this endeavor by intelligently allowing unused functions to be skipped during compilation and assembly. This leads to smaller, more trim binaries.
My original thought on this was to introduce an additional level of segmentation below the module level: the unit. A module is composed of one or more units, each of which is a self-contained aspect of the module. Tightly-integrated modules would have fewer units, looser modules, more.
The more I thought about, the more I realized that explicit (not to mention manual) segmentation of a module into units would be awkward and unwieldy as a module author. Are units allowed to call functions in other units of the same module? Doesn't that make them part of the same unit?
Maybe we can shift the burden of segmentation to the compiler...
Given the static call graph of a module, unconnected networks are the units! Compile each segment separately, caching them to speed up future compilation, and then at link time, just link in what you need, based on the program's call graph.
It works in theory, at least.
Module systems are important. As a programmer working in a language, more than half of my time is spent finding, learning and using modules (versus taking advantage of specific language features). To design a good language in the modern era, you have to design a good module system to go with it.
Happy Hacking!
]]>LLVM is a compiler-writer's toolkit that purports to make it easy to implement new languages without getting bogged down in lots of machine-specific details like Instruction Set Architectures, or theoretic details like graph-coloring, register-spill algorithms and peephole optimizations.
I've had this dream of building a LISP dialect that compiles to machine code, and produces small and fast static binaries. For the first half of this year, I've mostly been reading (hence the low volume of updates despite what my New Year's Resolution promised). Having so immersed myself in the literature on language design and compiler construction, from the "purest" academic subjects of language formalisms to the in-the-trenches practice of assembler programming (yes, people still do that and it's awesome), I think I'm finally ready to start programming.
So, LLVM.
The thing most people talk about with LLVM is its Intermediate Representation. They talk about it so much that they just call it IR
for short.
The IR provides a single common middle ground that all the front-end compiler modules generate. The back-end modules consume the IR and emit whatever the hell they want and/or need to. For example, the static compiler llc
can read in IR and generate an executable machine code image for your specific processor.
Front-ends are on the left, back-ends on the right.
The IR frees up the front-end module authors from having to worry about the specifics of the target back-end. Likewise, back-end module authors, who may only be well-versed in the performance / optimization constraints of a specific processor family, don't need to worry about source-language concerns.
Using an intermediate representation also allows pipelining of modules that both consume and emit IR. Peephole optimization is a particularly lucid example. Consider the following assembler snippet:
movl %ebx, %eax ; put the contents of EBX into EAX
movl $0, %eax ; put the literal value 0 into EAX
No human would ever write this, but a generative compiler might, especially if these two statements saddle the gap between two generated templates. A peeophole optimizer's job is to look at the above and determine that the first instruction is useless. Why should the machine bother transfering the contents of the EBX
register into the EAX
register, only to immediately overwrite that value with 0?
Really then, the above diagram looks more like this:
This architecture makes it seemingly trivial to add new front-ends, back-ends and optimization passes, and the language design and compiler-writing community has risen to the challenge.
LLVM can translate C/C++, Haskell, Swift, Ada, D, Delphi, Fortran and Objective-C.
It can target object and machine code for x86, ARM, PowerPC, MIPS, SPARC, and the IBM zSeries of mainframes.
The optimization stages (called passes) are even more impressive. Here's just a few:
Onward ------
Where to next?
For the motivated reader, here's some additional resources to satisfy your curiosity:
For my part, I'll be attempting to write follow-up essays in this series, covering the following topics:
I've been working on a transparent SQL router for PostgreSQL, something a bit like pgpool, but without all the "extra" functionality (no connection pooling, limited authentication intervention, etc.)
One of the key features of this new proxy is its transparency. Clients connecting to the router should be completely unaware that they are in fact talking to some piece of middleware that sits in front of PostgreSQL. They talk native FE/BE protocol, and the router responds in kind.
The documentation for PostgreSQL is normally top-shelf, but Chapter 46, which deals with the protocol, is surprisingly lacking in quality and confidence. It seems more like a design document than an API spec. So, armed with the information in there, I set out to probe a live PostgreSQL instance and see what it was going to spit back at me.
The PostgreSQL FE/BE protocol is a binary protocol that features lots of bits of text. SQL, after all, is a very ASCII-friendly language. As usual, when dealing with binary files, I turned to my old friend, od
.
od
is a utility for dumping the contents of files into whatever format you want for analysis. Here's some basic use cases that have made my life easier over the years.
First, treat the file like it's got embedded ASCII:
$ od -a /some/file
0000000 nul nul nul % nul etx nul nul u s e r nul j a m
0000020 e s nul d a t a b a s e nul e x a m
0000040 p l e nul
0000044
Or, if you prefer, hexadecimal:
$ od -x /some/file
0000000 0000 2500 0300 0000 7375 7265 6a00 6d61
0000020 7365 6400 7461 6261 7361 0065 7865 6d61
0000040 6c70 0065
0000044
You can even bypass the canned formats and specify your own with -t
:
$ od -t x1 /some/file
0000000 00 00 00 25 00 03 00 00 75 73 65 72 00 6a 61 6d
0000020 65 73 00 64 61 74 61 62 61 73 65 00 65 78 61 6d
0000040 70 6c 65 00
0000044
Now let's get back to my PostgreSQL story, and let's get really fancy.
Every PostgreSQL client connection starts out by sending a StartupMessage to the backend server. This startup identifies the version of the FE/BE protocol that the client understands, and contains some basic startup parameters. Each Startupmessage starts with a 4-octet length field, followed by a 4-octet version field. The most significant 16 bits of the version field encode the major version, and the least significant 16 bits represent the minor version. All values are interpreted in network byte-order.
Next comes zero or more pairs of strings, represented as a sequence of ASCII characters terminated by a NULL byte. To make testing easier and more reproducible, I wrote a small Perl script to generate the wire-protocol representation via the excellent pack utility. Here's the gist:
my $s = '';
$s .= pack('nn', 3, 0); # v3.0
$s .= pack('A*xA', "user", "james");
$s .= pack('A*xA', "database", "example");
print pack('N', 4 + length($s)) . "$s";
I can use od
to verify the pack, and inspect the binary structure of the generated StartupMessage:
$ ./pg | od -a
0000000 nul nul nul $ nul etx nul nul u s e r nul j a m
0000020 e s nul d a t a b a s e nul e x a m
0000040 p l e nul
0000044
It's kind of annoying that the 4-byte length field (bytes 0-4) displays as nul nul nul $
. Sure, I could go look up the ASCII value of '$' (it's 36). Or, I could just chain some -t
format specifiers together. I eventually ended up with:
$ ./pg | od -ta -tx1 -to4 --endian=big
0000000 nul nul nul $ nul etx nul nul u s e r nul j a m
00 00 00 24 00 03 00 00 75 73 65 72 00 6a 61 6d
00000000044 00000600000 16534662562 00032460555
0000020 e s nul d a t a b a s e nul e x a m
65 73 00 64 61 74 61 62 61 73 65 00 65 78 61 6d
14534600144 14135060542 14134662400 14536060555
0000040 p l e nul
70 6c 65 00
16033062400
0000044
Excellent. Now I can easily verify that the length field is 044, which matches up with the length of the input text there at the bottom (also 044).
Here's the play-by-play:
od
to use big-endian or network byte order when interpreting multi-octet sequences like our 4-byte length field.StartupMessage looks good, so let's run it through a live PostgreSQL node!
$ ./pg | nc 10.244.232.2 6432 | od -ta -tx1 -to4 --endian=big
0000000 E nul nul nul ~ S F A T A L nul C 0 8 P
45 00 00 00 7e 53 46 41 54 41 4c 00 43 30 38 50
10500000000 17624643101 12420246000 10314034120
0000020 0 1 nul M i n v a l i d sp s t a r
30 31 00 4d 69 6e 76 61 6c 69 64 20 73 74 61 72
06014200115 15133473141 15432262040 16335060562
0000040 t u p sp p a c k e t sp l a y o u
74 75 70 20 70 61 63 6b 65 74 20 6c 61 79 6f 75
16435270040 16030261553 14535020154 14136267565
0000060 t : sp e x p e c t e d sp t e r m
74 3a 20 65 78 70 65 63 74 65 64 20 74 65 72 6d
16416420145 17034062543 16431262040 16431271155
0000100 i n a t o r sp a s sp l a s t sp b
69 6e 61 74 6f 72 20 61 73 20 6c 61 73 74 20 62
15133460564 15734420141 16310066141 16335020142
0000120 y t e nul F p o s t m a s t e r .
79 74 65 00 46 70 6f 73 74 6d 61 73 74 65 72 2e
17135062400 10634067563 16433260563 16431271056
0000140 c nul L 2 0 7 3 nul R P r o c e s s
63 00 4c 32 30 37 33 00 52 50 72 6f 63 65 73 73
14300046062 06015631400 12224071157 14331271563
0000160 S t a r t u p P a c k e t nul nul
53 74 61 72 74 75 70 50 61 63 6b 65 74 00 00
12335060562 16435270120 14130665545 16400000000
0000177
Oh noes! PostgreSQL is definitely not happy with our StartupMessage/ packet. It returns an ErrorResponse message letting us know why: invalid startup packet layout: expected terminator as last byte. Unlike the message we sent, the ErrorResponse reply message is typed; the first octet is an ASCII character that identifies what type of message it is. E tells us we are dealing with an ErrorResponse.
Unfortunately, that single leading octet throws off our -to4
, leading to the laughably high message length of 10500000000 (about 1.2GB). This is what --skip-bytes
(-j
to her friends) was made for! We can skip the type byte, since we now know that it is an 'E', and re-align on our 4-byte boundary:
$ ./pg | nc 10.244.232.2 6432 | od -ta -tx1 -to4 --endian=big -j1
0000001 nul nul nul ~ S F A T A L nul C 0 8 P 0
00 00 00 7e 53 46 41 54 41 4c 00 43 30 38 50 30
00000000176 12321440524 10123000103 06016050060
0000021 1 nul M i n v a l i d sp s t a r t
31 00 4d 69 6e 76 61 6c 69 64 20 73 74 61 72 74
06100046551 15635460554 15131020163 16430271164
0000041 u p sp p a c k e t sp l a y o u t
75 70 20 70 61 63 6b 65 74 20 6c 61 79 6f 75 74
16534020160 14130665545 16410066141 17133672564
0000061 : sp e x p e c t e d sp t e r m i
3a 20 65 78 70 65 63 74 65 64 20 74 65 72 6d 69
07210062570 16031261564 14531020164 14534466551
0000101 n a t o r sp a s sp l a s t sp b y
6e 61 74 6f 72 20 61 73 20 6c 61 73 74 20 62 79
15630272157 16210060563 04033060563 16410061171
0000121 t e nul F p o s t m a s t e r . c
74 65 00 46 70 6f 73 74 6d 61 73 74 65 72 2e 63
16431200106 16033671564 15530271564 14534427143
0000141 nul L 2 0 7 3 nul R P r o c e s s S
00 4c 32 30 37 33 00 52 50 72 6f 63 65 73 73 53
00023031060 06714600122 12034467543 14534671523
0000161 t a r t u p P a c k e t nul nul
74 61 72 74 75 70 50 61 63 6b 65 74 00 00
16430271164 16534050141 14332662564 00000000000
0000177
Everything seems to be as expected. A length of 0176
(which matches our 0177
final address if you remember the skipped octet). We can also see the structure that the PostgreSQL manual says we ought to. An 'S' frame, a 'C' frame, a few of the optionals and NULL-terminated string payloads for each.
If you look closesly, you'll notice that the final message itself is terminated by a NULL \0
octet. Is this what the ErrorResponse is complaining about?
Yes!
By adding a final NULL terminator to the Perl script:
my $s = '';
$s .= pack('nn', 3, 0); # v3.0
$s .= pack('A*xA', "user", "james");
$s .= pack('A*xA', "database", "example");
print pack('N', 4 + length($s)) . "$s\0"; # NULL!
We can now elicit a Ready response from the PostgreSQL backend:
$ ./pg | od -ta -tx1 -to4 --endian=big
0000000 nul nul nul % nul etx nul nul u s e r nul j a m
00 00 00 25 00 03 00 00 75 73 65 72 00 6a 61 6d
00000000045 00000600000 16534662562 00032460555
0000020 e s nul d a t a b a s e nul e x a m
65 73 00 64 61 74 61 62 61 73 65 00 65 78 61 6d
14534600144 14135060542 14134662400 14536060555
0000040 p l e nul nul
70 6c 65 00 00
16033062400 00000000000
0000045
./pg | nc 10.244.232.2 6432 | od -ta -tx1 -to4 --endian=big
0000000 R nul nul nul ff nul nul nul enq L d ' `
52 00 00 00 0c 00 00 00 05 4c 64 a7 60
12200000000 01400000000 00523062247 14000000000
0000015
Success! A ReadyResponse (note the R)!
od
is a small but flexible tool for analyzing weird files, binary network protocols, and even text-based network data (HTTP anyone?) to make sure it doesn't have any weird or unexpected octets in it. It's one of the many tools I use on a daily basis, both in writing code and in administering systems.
Happy Hacking!
]]>I got the wife an iPad Pro and an Apple Pencil for Christmas, since she's an artist, and artists need good tools. Sometimes she lets me use it for my silly art: diagrams.
I work in a text editor all day. I write lots of code, in lots of languages. Most everything I do is via a keyboard. But damn does it feel good to draw again. The tactile sensation of using a pencil to make marks on a page (even if that page is glassy and the pencil is smooth) is remarkable (ha!)
Having layers, millions of colors, tons of brushes and (most importantly) an undo button? Priceless.
]]>If you're running an Open Source project using a Copyleft license like the GNU General Public License, its weaker cousin the LGPL, or even a variation on the Apache License, you should probably be getting Contributor License Agreements from everyone submitting code for inclusion.
Why?
Without a CLA, or some other legal contract, you technically have no right to redistribute anyone elses changes. There's a big debate online about whether CLAs are too heavy-handed, whether they provide any real protection, and if they drive away would-be contributors.
I want to side-step the socio-legal issues for now, and just focus on using a the Travis CI tool for enforcing CLAs whenever a Github Pull Request is submitted.
The concept is pretty straightforward: maintain a list of identities (read: email addresses) of people who have signed and submitted CLAs. Keep that file in-repo, and teach the CI build script to check the new commits it's testing against that list.
Simple!
Here's the format of the CONTRIBUTORS
file:
# comments start with an octothorpe
# and run until the end of the line
# ^^ blank lines are ignored.
# the normal case - one person, one email
Felicia Adkins <felicia@example.com>
# two emails can be supplied if necessary
Mike Nash <m@example.com> <mike.nash@example.com>
When someone signs a new CLA, add their name to the list!
With Travis, we can use the TRAVISCOMMITRANGE
environment variable, which lists the range of commits (SHA1 IDs) that were included in the git push
or Github Pull Request. The idea is this:
CONTRIBUTORS
fileCONTRIBUTORS
, fail (but defer...)Here's a Bash function you can add to your CI script:
check_cla() {
local passchar="\xe2\x9c\x94" # U+2714 - ballot check
local failchar="\xe2\x9c\x98" # U+2718 - ballot x
local rc=0
local IFS=$'\n'
echo "Checking CONTRIBUTOR status..."
for x in $(git log --pretty=format:'%aE %h - %s (%aN <%aE>)' \\\\
${TRAVIS_COMMIT_RANGE}); do
email=${x%% *}
desc=${x#* }
if grep -q '^[^#].*<'${email}'>' CONTRIBUTORS; then
echo -e "\033[32m${passchar}\033[0m $desc"
else
echo -e "\033[31m${failchar}\033[0m $desc"
echo -e " \033[31m<${email}> not listed in CONTRIBUTORS file!\033[0m"
rc=1
fi
done
echo
return $rc
}
When you call it, it will print out something like this:
If it finds any commit authors who aren't listed (again, by email address) in the CONTRIBUTORS
file, it will print a message to that effect and ultimately return non-zero. With set -e
enabled, this will fail your build. Otherwise, you can just check $?
or wrap the call in an if
block.
Happy Hacking!
]]>Did you know that in a group of only 23 people, there's a 50% chance that at least two of those people share the exact same birthday?
Yup. It's called The Birthday Paradox, and it's got some interesting math behind it. Surprisingly, it has some pretty far-reaching implications for crytography, and highlights a fundamental thing I've noticed about crypto: your intuition is usually wrong.
Let's flip the problem on its head to make the math a little easier. Instead of calculating the probability that people are sharing birthdays, let's consider the probability that no one shares a birthday. Together, these two quantities must add up to 1, since there are no other possible outcomes - either we have collisions, or we don't.
We'll call the probability of unique birthdays \(P\). Let's also ignore February 29th (sorry leap-year kids), and assume that we have a uniform distribution (that is, no dates are special). This gives us a total of 365 possible values, each one equally likely.
If we have a group with only one person in it, we have all 365 possible dates to choose from, so \(P = 1\).
$$ P_1 = 365 / 365 = 1 $$
If we add a person, there are now 364 unique birthdays left, so the problem becomes one of selection:
$$ P_2 = {365 - 1} / 365 = 0.9973 $$
There's a 99.73% chance these two people won't share a birthday.
Incidentally (although it's easy to miss in a cursory glance), we can calculate the total probability \(P\) by multiplying the two independent event probabilities we have so far:
$$ P = P_2 P_1 = ({365 - 1} / 365) (365 / 365) = 0.9973 $$
If we add a third person, we not only have to pick a unique birthday for the second person (\(P_2 = {365-1}/365\)), we now have to pick a unique date from the remaining birthdays for the third person, or:
$$ P_3 = {365 - 2} / 365 = 0.9945 $$
Again, we can calculate the total probability \(P\) from the independent event probabilities:
$$ P = P_3 P_2 P_1 $$
$$ P = ({365 - 2} / 365) ({365 - 1} / 365) (365 / 365) $$
$$ P = 0.9945 × 0.9973 × 1 = 0.9918 $$
Since the probabilities of each event after the first are less than 1, the compound probability is less than either independent event. This is where we start to see that the intuition is wrong, because most people think in terms of the individual events.
In a more general form, for a set of \(N\) unique values, selecting \(k\) of them leads to a probability \(P\) that all values are unique of:
$$ P = {N - 1} / N {N - 2} / N … {N - (k -1 )}/N $$
There will be \(k-1\) terms in this equation, so we can simplify it thusly:
$$ P = {(N - 1) (N - 2) … (N - (k - 1))} / N^{k-1} $$
Introducing an \(N\) term to both numerator denominator, we start to approach \(N!\):
$$ P = (N / N) ({(N - 1) (N - 2) … (N - (k - 1))} / N^{k-1}) $$
Then the denominator simplifies to \(N^{k-1+1}\), which is just \(N^k\):
$$ P = {N (N - 1) (N - 2) … (N - k)} / N^k $$
Further simplification can be had by introducing the remainder of the factorial into both the numerator and denominator:
$$ P = {N (N - 1) (N - 2) … (N - (k - 1)) (N - k)! } / {N^k (N - k)!} $$
This allows us to collapse the numerator to just \(N!\):
$$ P = N! / {N^k (N - k)!} $$
That is the final (factorial) form of what we'll call The Birthday Bound.
Armed with our formula for calculating the probability of no collisions, we can easily figure out the probability we actually care about:
$$ P_c = 1 - N! / {N^k (N - k)!} $$
That's still a lot to calculate with a computer. If you want to extend this topic to cover hash collision (and I most certainly do), you have to start looking at rather large n-bit values for \(N\), like \(2^64\), or \(2^192\) (for XSalsa20). I really don't want to have to calculate \(2^192!\). There may not even be enough time before the heat death of the Universe to grind on that calculation, even with our fastest supercomputers.
The general rule of thumb is that you will start seeing hash collisions at \(2^{n/2}\) uniformly distributed random values in the range \([0, 2^n)\). The math behind the approximation seems to involve Taylor Series approximations and the natural logarithm. Hopefully I can dust off the old college Calc texts and re-acquaint myself with those topics and bring you a more in-depth treatment next time!
Happy Hacking!
]]>BOSH makes a whole lot of tasks in the operations / systems management space way easier than ever before. Combine that with tools like Spruce and Genesis, and you have a really powerful paradigm for managing your deployments. Pair that with Concourse and it seems like the sky is the limit!
Then you run into the security problem.
In order for your Concourse pipeline to deploy your BOSH manifests, you need the manifests. No problem; just stick it in a git repository somewhere and you're good. Except that BOSH manifests are full of all kinds of sensitive information like database credentials, root passwords and IaaS secret keys.
Enter Vault, the secure credentials storage system from Hashicorp. Vault is a really slick piece of technology, and it would be great if we could just integrate it into our Concourse deployment pipelines, right?
Sure. Let's do that.
Genesis is a deployment paradigm for BOSH that builds on top of Spruce and comes with its own, embeddable, helper script. It sets up new deployment manifest repositories (in git) using a multi-tiered structure of increasingly more specific layers of BOSH manifest file overrides (global → site → environment).
You can read more about it here.
When it comes to Vault and secrets, Genesis Just Works™
Genesis will detect that you are running with access to a Vault (by way of the VAULTADDR
environment variable), and will behave accordingly. Notably, this involves generating two versions of your BOSH manifest — one with credentials in it, to be used for deploying; and one without_ credentials, suitable for committing to your upstream git repository.
It also features rudimentary Concourse pipeline support for doing basic deployments. This too takes advantage of the Vault integration to pull down secrets during deployment.
Before we can start experimenting with all these neat toys, we're going to need a spinning Vault instance. For our purposes we'll use BOSH and a self-contained in-memory storage backend.
Luckily, there is a BOSH release for Vault that we can use!
$ git clone https://github.com/cloudfoundry-community/vault-boshrelease
$ cd vault-boshrelease
$ bosh upload release releases/vault/vault-0.1.3.yml
Then, spin up a deployment. Go ahead. I'll wait.
Once Vault is up, you should take note of the IP that your vault
VM is running on, by running bosh vms
. Put that in an environment variable named VAULT_ADDR
and export it.
$ export VAULT_ADDR=http://<VAULT-IP-ADDRESS>:8200
Then, you'll want to follow the Getting Started Guide on the Vault website, to unseal your vault and get access with the root token.
Vault is optimized for people. It provides a wealth of authentication tie-ins to systems like LDAP and Active Directory, so that organizations can enforce policy globally, and users have one fewer password to remember. This poses a bit of a problem for our situation, since we want a robot (our Concourse worker) to be able to securely authenticate to the vault without the assistance of a human.
The App-ID backend (documented here) provides us just that. We can configure two tokens, the user-id and app-id, lock it down with a network ACL, and drop those two tokens to our pipeline configuration.
First up, we'll need to enable the app-id authentication method:
$ vault auth -methods
Path Type Description
token/ token token based credentials
$ vault auth -enable app-id
$ vault auth -methods
Path Type Description
app-id/ app-id
token/ token token based credentials
Next, we'll configure an app-id token, by writing to the correct backend path, like so:
$ vault write auth/app-id/map/app-id/testing-deployment-pipeline \\\\
value=root \\\\
display_name="Testing Deployments pipeline"
There's a lot going on here, but here's the highlights:
auth/app-id/map/app-id/testing-deployment-pipeline
is just how the Vault app-id backend needs to be configured. Everything but that last component is literal. The last part is the app-id token itself, in this case testing-deployment-pipeline
.value=root
argument associates the new token with the access policy named, (here: "root"). This governs what access is allowed for machines that successfully authenticate with this token.display_name=...
argument sets the name to be used in CLI output.An app-id token is useless without an associated user-id token, so let's make one of those too:
$ vault write auth/app-id/map/user-id/concourse \\\\
value=testing-deployment-pipeline \\\\
cidr_block=10.244.0.0/16
That's a keyboard-ful. Breaking it down:
concourse
is the new user-id token we are creating, and the path ends at user-id
instead of app-id
.value=...
argument associates the user-id token with the app-id token we just created (testing-deployment-pipeline
). In Vault parlance, the user ID is now mapped to the app-id, and the two tokens can be used together to authenticate.cidrblock=...
argument restricts where the user-id_ token can be used from. Any attempts to authenticate as "concourse" from anywhere outside of the 10.244.x.x network will fail.We're also going to create a fixed secret, called "secret/handshake" that we will use later to validate authentication during automation runs:
$ vault write secret/handshake knock=knock
You can validate that the secret was saved (and see what the diagnostic output in our pipeline will look like) by running:
$ vault read secret/handshake
Key Value
lease_duration 2592000
knock knock
That's it. Vault is configured!
Configuring security is not a lightweight task, and definitely demands attention to detail and an appreciation for the subtleties of your environment, intended use cases and technology stack. To keep this already very long post short, I'm just using the default root policy that ships with Vault.
Don't do that in production
The root policy has full access to all secrets in all backends. In real environments, you will want to create a new policy with locked down and very specific access, and use that instead.
You'll also notice that these commands pass secrets in the clear as command line arguments. These have the unfortunate side effect of showing up in the process table, in sudo logs and shell history file (for fun, grep vault ~/.*history
)
Instead, you should be passing credentials via files, using the @/path/to/file
invocation style. Make sure you chmod the files properly before you put your secrets in them!
$ mkdir ~/secrets
$ chmod 0700 ~/secrets
$ touch ~/secrets/key
$ chmod 0600 ~/secrets/key
$ vi ~/secrets/key
...
$ vault write secret/key/stuff @~/secrets/key
$ rm ~/secrets/key
And remember, those are JSON files.
I'm going to cheat here and lean heavily on Genesis for generating my pipelines. It creates all the necessary bits and pieces of configuration and scripts, and handles Vault for you.
Assuming you start with a genesis-managed deployment:
$ cd testing-deployments
$ genesis embed
...
$ genesis repipe
...
The genesis embed
call stores a copy of your current genesis script in the top-level bin/
directory. The Concourse/Genesis integration pieces rely on this to avoid mixing versions.
When we call genesis repipe
, Genesis looks at all of the Concourse configuration fragments in the ci/
directory and assembles them into a single cohesive configuration. (It also calls out to Vault if you set up the templates to pull in secrets like BOSH passwords.) It then takes the final configuration and uploads it to Concourse to configure the pipeline.
When all is said and done, we will have a new pipeline configured in our Concourse installation, all ready to go.
The Docker image that Concourse is going to spin needs the following utilities:
You actually get this for free if you use Genesis to generate your deployment, since it sets up a job to build a custom Docker image, as part of the pipeline itself.
Note: you'll need v1.0.1 of Spruce to use the (( vault ...))
operator, since it's not in v0.13.0 or below.
Magic!
When you push new commits to master, Concourse will take note and kick off the deployment. It spins up the Docker task image, and does the following:
$VAULT_ADDR
using the two tokens (which are themselves passed in as environment variables)vault status
(This has the useful side-effect of providing diagnostics when / if thesecret/handshake
bit up as we were configuring Vault. If you skip that step, the automation will fail because it thinks that it has not successfully authenticated to Vault.
Once it can access the vault and retrieve secrets and credentials, the pipeline runs some bits of Genesis to combine all of those YAML templates together via Spruce.
When Spruce sees an operator like this:
---
meta:
credentials: (( vault "cloud/admin:password" ))
it does one of two things. In the clean manifest (the one without credentials) it replaces the operator with the literal string "REDACTED". This indicates that there is supposed to be sensitive information there, but that it has been hidden to allow the manifest to be committed to git. In the deployment manifest (the one with credentials) it contacts the vault and asks for the secret/cloud/admin
secret, and extracts the 'password' key from that, replacing the operator with that value.
With the manifests (plural) generated, the pipeline moves onto the next step, and attempts to bosh deploy
the deployment manifest (secrets and all). If that succeeds, it commits the clean manifest, destroys the deployment manifest and pushes the (scrubbed) changes back to origin.
The upshot of this little dance is that credentials live inside the secure confines of the vault, and are only exposed for a small window of time inside the executing Docker container. They do not get commited.
This is not a perfect solution.
It suffers from a few large gaps in protection. For starters, neither Concourse nor BOSH understand what parts of their configuration / manifests are sensitive, so they do not redact them. They do however, make it possible to retrieve configuration.
$ bosh download manifest test-deployment
...
$ echo "--- {}" > empty.yml
$ echo n | fly set-pipeline -p test-deployment-pipelines -c empty.yml
...
This can, if not handled with appropriate caution, completely sidestep all of the protection of a Vault-enabled pipeline.
The bosh deploy
command also prints a semantic difference highlighting changes being made to the deployment. If you have rotated passwords inside your vault, the new secrets will be printed in the clear on the next deploy. Concourse compounds this risk by making that output available as part of the job log in its web user interface.
Beyond that, anyone with direct access to the BOSH director, or any of the VMs inside of the BOSH deployments (especially the Concourse VMs) can access passwords that are rendered via job templates into files on-disk.
For these reasons, you must be careful to configure appropriate compensating controls in your environment. This boils down to:
Hopefully in the future, the BOSH and Concourse teams will turn their prodigious software engineering talents towards hardening these products to be more security-consicious. There's currently a pull request out to the Concourse team for the bosh-deployment-image
resource that adds the --redact-diff
option to the BOSH deploy command, to hide the diff output altogether.
Hopefully, you're all fired up about protecting sensitive credentials without losing the ability to automate your BOSH deployments via Councourse pipelines. To dig a little deeper, check out these resources:
Happy Hacking!
]]>go get
sure is neat, huh? Want some code, and know that it's hosted on Github? just go get github.com/USER/REPO
and you're good to go! But what if you're a go-getter (heh) and want some vanity in your life?
How about this?
$ go get jameshunt.us/bolo/core
It's incredibly easy to do. You'll need the following:
Let's say you want to be able to do this:
$ go get example.com/extras/fun
yet pull the code from the example/extras-fun
repo on Gihub?
All you need to do is add the following <meta> tag, with the correct name
and content
attributes, to the page at http://example.com/extras/fun
:
<meta name="go-import"
content="example.com/extras/fun git https://github.com/example/extras-fun">
That's it.
The content of the meta tag (the metadata itself), is three space-separated tokens:
The import prefix is the top-level of the module being imported. More on that lter.
The vcs is a short string identifying which version control system needs to be used to pull down the repository code. Valid values are "git", and some others that I don't use.
The repository is the full URL to the repository to retrieve. In other words, our Github repository URL (in read-only mode).
If you have submodules (say extra/fun/times
) that need to be go-gettable in their own right, you do the same thing. I mean, you add the exact same meta tag to the page for times
:
<meta name="go-import"
content="example.com/extras/fun git https://github.com/example/extras-fun">
You have to do the first step though. Why? Given:
$ go get example.com/extras/fun/times
go get
will do the following:
https://example.com/extras/fun/times
go-import
<meta> taghttps://example.com/extras/fun
go-import
<meta> tagNote that step 4.c does its matching based on the previous import prefix (as determined in step 3)
At the outset, I mentioned needing an SSL/TLS certificate. Technically speaking, you don't need it, but if you don't encrypt your endpoint, go-getters will need the ugly -insecure
flag. It's best just to avoid it, given that you can get a single-domain certificate for about $5/year from here.
Time to revisit the usual gripes with recruiter emails:
I suppose recruiting firms have a vested interest in prospective employees not bypassing them and applying directly to the company. If that's the case, the recruiting firm needs to nail down the excusivity clause in the contract with their client.
Ambiguous descriptors like 'a bright software company' or 'industry leader' tell me nothing. "New Bay Area Startup" is about as descriptive as "that one company that does things"
Seriously. Who the &*$! is hiring?
Let's be honest, applying for a job and going through the interview process is work. It takes time, and can be nerve-wracking. Why would I go through all of that if the position pays less than I need or want?
This isn't about greed. It's about financial sense. If I currently make $100k, and you are offering me a job to relocate across the country, for $90k, chances are I can save both of us time by opting out early.
It states clearly on my résumé where I live. The recruiter knows where the job is. Somehow, I have to get from here to there, get out of a mortgage or a lease, pay for moving expenses and find a place to live.
Not even a mention of relocation expenses? Drop $10k off of that salary you won't tell me about.
I don't need the job the recruiter is offering. The recruiter needs to fill the position they have before they lose the contract.
That's their need, not mine.
]]>I hardly ever make New Year's resolutions, but here goes.
In 2016, I resolve to:
If those four pan out, 2017 may just be the year of the Lisp-that-compiles-to-native-code.
]]>From heavyweights like Visual Studio and Eclipse to more lightweight alternatives like Atom and CodeLite, Integrated Development Environments, or IDEs, continue to persist into the modern era as an "essential" tool of software development.
I must disagree.
Over the last two decades, I've written code in a variety of contexts, using different languages, libraries, and frameworks. I've written PHP, C, Java, Ruby, Perl, Common Lisp, Scheme, Python, Javascript, and more, building medium- to large-scale systems that run in production. I've developed for a bevy of old-school UNIX platforms, Linux, BSD, and the web.
And I have never found IDEs to be helpful to me, as a programmer.
When pressed, IDE advocates usually offer up the following bullet-points as benefits of working inside of an IDE:
(I have left out the vague and unquantifiable reasons like "they help boost productivity" and "they makes things better", and the nonsensical responses like "how else do you write the codes?")
Each of these so-called benefits is in fact only beneficial to beginning and/or mediocre developers.
Given that the editor actually understands the code being written, and has a deep knowledge of the language and its faculties, code completion sounds like a great idea. If I start with the following snippet of C, for example:
#include <stdio.h>
int determine_mode(int n, char **args);
int main(int argc, char **argv)
{
int apples = 42;
/* next line intentionally incomplete... */
int mode = determine_mode(
return 0
}
... and then type the character a
after the opening parenthesis of the call to determine_mode
, the IDE can helpfully popup a list of possible variable names. In this case, apples
or argc
(argv
is not a candidate because the first argument to the function must be an integer, not a character pointer)
Apply this to function invocation (auto-complete the function name) and object / struct member names (what are all those members called, anyway?!?) and you've got yourself a pretty slick and helpful tool for getting code written faster. After all, it takes a non-zero amount of time to press keys, so the fewer keystrokes it takes to get code, the better off you'll be, right?
Right?
Wrong.
Sure, popup-coding helps people unfamiliar with the code base, or the library, or even the language, but they will never move beyond the need for the code-completion crutch. I know (several) professional programmers out there who cannot write code without the editor telling them where to go next.
It's no secret that there's limited space in between your ears, and everything else is always vying for that precious real estate. When do I have to pick up the dry cleaning? What's left to do to meet that deadline? Why is this stupid race condition so hard to track down?
When you externalize bits of the library (what arguments does this function take? What is its return value?), or worse, bits of the language itself (how do for
loops work? What's the syntax for a goroutine?), you push them out of your head and into the editor. Which is precisely the wrong thing to do when you want to excel at software development.
Instead, professional programmers should train their minds to be able to hold more information about the problem at hand in their head. Contrary to popular belief (and an unspoken foundation of IDE enthusiasm) you can strengthen your mind to remember more.
Some IDEs offer a productivity panacea that's even more potent (and alluring) than popup-coding: Code Generation.
It works like this: you fill out a form, click a few buttons, and -poof- scads of code appears with almost no effort. Pretty cool, eh?
In its more usable form as Automatic Programming, yes, code generation is pretty nifty. But most IDEs get it wrong in a few very important ways.
For starters, code generation is usually a one-time event. This is just glorified templating, scaffolding or, as I like to call it, boilerplate. Boilerplate is bad because requirements change over time, and the code so-generated must also change. Sadly, subsequent change to generated code is almost always manual, which obliterates any value gained from the initial productivity boost.
There are ways around this. You can write code that in turn writes more code. If you do that outside of the programming language it's called a domain-specific language or DSL. If you embed it into the language, it's a macro (Lisp-style, not C-style).
This brings me to my second gripe with IDE code generation tools: a good automatic programming facility should live outside of the editor, and become part of either the language itself (as a macro) or the codebase (as a DSL). The editor is the wrong place for it.
Code - compile - debug - commit, all from the comfort of your editor!
Actually, that does sound like a good idea. Too bad IDEs go to far. This time, instead of insulating you from how your code and/or language actually work, they shield you from the messy details of how your software gets built, how to invoke the debugger, how to navigate your version control system, etc.
To their credit, IDE writers really only have two options in this space: write their own compiler / debugger / version control system and embed it directly in the IDE (Microsoft / Borland), or attempt to integrate external tools as-is (Eclipse, mostly).
This is a lose-lose situation.
If you go with the "custom tools" solution, you are going to get an inferior compiler / debugger / version control system. The people writing the IDE have limited time and effort to focus on the task of writing the IDE itself, which usually means the tooling suffers. Turns out that compiler-writing is hard, and requires a special brand of hacker to pull off correctly. Debuggers are intimately intertwined with the compiler, since they have to understand the runtime, stack discipline, etc.
If you opt for the external integration, you also lose out. Sure, you're using GCC / GDB / git / what-have-you, each with their own teams of smart and dedicated people making them great, but the IDE necessarily limits how you get to use those tools. This may work out great for the beginner programmer, who just wants to focus on the code, but it hides the tools and thereby hinders advancement.
The point is, any tool you can embed into your IDE is a tool you'd be best served learning how to use on its own. Some really powerful stuff can be done with tools like gdb
and git
.
IDEs have some pretty insidious problems baked right-in.
I once saw a codebase that had defined a function for retrieving objects from a database. For sake of illustration, I'm going to call those objects Things
. Here's the function header (in Go, body omitted for brevity):
func (db *Database) GetExstingThings() ([]*Thing, error) {
/* ... */
}
While doing the code review, I made a mental note that this function was probably cruft, and not used, because Existing
was misspelled. Imagine my surprise when I found several (dozens or more) calls to this function, all with the same typo!
That's popup-coding if ever I've seen it. G
-e
-t
-E
-x
-screw-it-that's-close-enough-⏎
.
Sure, it's an innocent enough mistake, and one that search-and-replace (or better yet, those refactoring tools I keep hearing so much about) would make quick work of. But the problem goes deeper than that.
Not having to type the names of functions, or remember the order of parameters leads to bad design. If I rely exclusively on auto-completion logic, I have no incentive to come up with compact and concise function names; I'll just let the editor type the rest of ThatReallyLongFunction's name. This in turn leads to half-hearted attempts at finding good abstractions, since the point of analogy and abstraction is to reduce the cognitive load on the programmer through well-known patterns.
One complaint I often hear about libc, the C standard library, is that functions like memcpy
are confusing with respect to parameter order. Here are the two possibilities:
void memcpy1(void *src, void *dst, size_t n)
void memcpy2(void *dst, void *src, size_t n)
(spoiler: it's the second one)
IDE supporters are quick to point out that code-completion makes this a moot point; the editor knows the type signature and the documentation for the function, so you don't have to.
There's a trick to memcpy
that makes it easy to remember. The destination comes before the source argument, which mimics assignment in C:
char *a, *b;
a = strdup("Hello, World")
/* b = a */
memcpy(b, a, strlen(a))
You know what's really cool? Once you know and understand the "mimic assignment" convention, you can apply that to other functions with source / destination semantics, like memmove
, strcpy
, etc. On top of that, you can now spot lurking bugs in the code that auto-completion won't show you, like this:
char *a, *b
a = strdup("Hello, World")
memcpy(a, b, strlen(a))
If you didn't know that the memcpy
line translates to a = b
, you'd miss the stack overflow bug waiting in the dereference of the uninitialized b
pointer.
IDEs promise the world and deliver something (I guess). That something isn't something you want as a professional. Masking complexity, hiding tools, encouraging bad design; all these things are bad.
If you want to become a better programmer, ditch the IDE and learn to use your language's toolchain.
]]>UPDATE: Geoff helpfully pointed out this morning that you can set the Warden netmask via the bare-metal.yml.example file. But that's no fun.
I recently spun up BOSH-lite on my lab server, thanks to the vagrant tweaks that Geoff Franks made against Ruben Koster's bare-metal-bosh-lite code.
With BOSH spinning, I put together a small testing BOSH release (which you can find here), and deployed it. The deployment worked fine, and soon I had some test VMs running as Warden containers on the server.
Problem is, the networking was all mismatched.
Anyone familiar with BOSH-lite is accustomed to the 10.244.0.0/16 network space that it uses by default. This convention over configuration approach has lead to lots of BOSH releases shipping a Warden configuration out-of-the-box that spins up static IPs in the 10.244 network. No problem; BOSH-lite is intended to be used for development.
For reasons I'll not go into right now (maybe a future blog post), I want to be able to run several bare-metal BOSH servers using Warden containers. I don't want to invest the money in vSphere, and I don't have the patience to stand up Openstack. That leaves AWS, which is too expensive for my ambitions, and a bit overkill.
Which is why I turned to BOSH + vagrant in the first place.
The problem I ran into had to do with routing. Here's a highly simplified view of the network topology inside of my lab:
All wireless clients (laptops and phones included) live on the Unprivileged Access Network (the pink cloud), which routes everything to the Core Network (green) via a consumer-grade wireless router at 10.0.1.1.
The core network router is a Linksys WRT54G with a custom build of OpenWRT that allows it to manage routing, firewall duty, and termination of a handful of OpenVPN point-to-point VPNs that stitch my network into a larger one distributed across the Internet.
The important part of the diagram, however, is the 10.4.0.0/16 (yellow) network labeled "Service Network 1". This network exists entirely inside of the beefy server hardware that I am running BOSH-lite on. The Core Network OpenWRT router has static routes for the 10.4/16 subnet.
The easy solution, of course, is to just change the Service Network range from 10.4/16 to 10.244/16. Problem solved. Off to the deployments!
Unfortunately, that masks a not-so-subtle problem with the default configuration of BOSH-lite (which, as a develpoment platform, is not important enough to warrant discussion): you can't run more than one!
Just for fun, let's try changing the BOSH deployment manifest to use networks in 10.4/16, and see if it works!
jumpbox $ ./test-dev manifest warden
jumpbox $ sed -i -e 's/10.244/10.4/' manifests/test-warden-manifest.yml
jumpbox $ bosh -n deploy
Acting as user 'admin' on deployment 'test-dev' on 'Bosh Lite Director'
Getting deployment properties from director...
Deploying
---------
Director task 59
Started unknown
Started unknown > Binding deployment. Done (00:00:00)
Started preparing deployment
Started preparing deployment > Binding releases. Done (00:00:00)
Started preparing deployment > Binding existing deployment. Done (00:00:00)
Started preparing deployment > Binding resource pools. Done (00:00:00)
Started preparing deployment > Binding stemcells. Done (00:00:00)
Started preparing deployment > Binding templates. Done (00:00:00)
Started preparing deployment > Binding properties. Done (00:00:00)
Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
Started preparing deployment > Binding instance networks. Done (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started preparing dns > Binding DNS. Done (00:00:00)
Started creating bound missing vms > small_z1/0. Failed: Creating VM with agent ID \
'aacf0752-d306-4788-b6ee-aabbfc338d4b': Creating container: network already acquired: \
10.4.2.8/30 (00:00:01)
Error 100: Creating VM with agent ID 'aacf0752-d306-4788-b6ee-aabbfc338d4b': \
Creating container: network already acquired: 10.4.2.8/30
Task 59 error
For a more detailed error report, run: bosh task 59 --debug
You can try any number of different networks, but if they aren't in 10.244/16, BOSH (or more specifically, Warden) will fail to provision the network, claiming that it is "already acquired".
Luckily, while the decision to use 10.244/16 is hard-coded into the BOSH-lite/warden distribution, it is explicitly called out in exactly one place: the startup script that runs the Warden supervisor.
boshbox # grep -nC6 10.244.0.0 /var/vcap/jobs/warden/bin/warden_ctl
29- exec /var/vcap/packages/warden-linux/bin/warden-linux \
30- -disableQuotas=true \
31- -listenNetwork=tcp \
32- -listenAddr=0.0.0.0:7777 \
33- -denyNetworks= \
34- -allowNetworks= \
35: -networkPool=10.244.0.0/16 \
36- -depot=/var/vcap/data/warden/depot \
37- -rootfs=/var/vcap/packages/rootfs_lucid64 \
38- -overlays=/var/vcap/data/warden/overlays \
39- -bin=/var/vcap/packages/warden-linux/src/github.com/cloudfoundry-incubator/warden-linux/linux_backend/bin \
40- -containerGraceTime=5m \
41- 1>>$LOG_DIR/warden.stdout.log \
If you change line 35 to specify the -networkPool
option as 10.0.0.0/8
, and subsequently restart the Warden supervisor, you can provision against any subnet of 10/8, and even use different nets on different boxen.
boshbox # sed -i -e 's@10.244.0.0/16@10.0.0.0/8@' \
/var/vcap/jobs/warden/bin/warden_ctl
boshbox # monit restart warden
Happy Hacking!
]]>Over the weekend, something bad happened on the network.
(That's the opening line of the scariest techno-thriller on-call personnel can think of)
Anyway, these network goings-ons caused an interesting situation for bolo, the metrics gathering and monitoring system. One of the subnets was able to open a connection to the bolo core, but unable to properly close it (or even, it is believed, to send any useful data). About 30 machines in this network happily spent the rest of the weekend opening socket after socket, and then leaking them.
After less than a day, this handful of machines was able to run the bolo core out of its hard ulimit on open file descriptors, about 65k. This is where the real fun started.
Other clients, in correctly routed networks, became unable to nail up connections to the core, and started stacking TCP connections in the SYN_SENT
state. When enough of these were created, the packet loss began.
The operational fix, for those of you who are interested in that sort of thing, was to firewall off the misbehaving network and restart the bolo process to start over on file descriptors.
But that's not what this post is about — it's about detection.
On any other machine, or for any other process, a simple monitoring check (like the process
collector in bolo-collectors) would be sufficient. Let the monitoring agent peer into /proc
, tally the open files, and relay that data up to the bolo core. But when the process that may exhaust open files is the bolo core, you run into the problem of not being able to submit the data. No graph. No alert. No visibility.
The logical course of action then, is to teach bolo how to count up its open file descriptors and track it internally, without needing a socket or pipe.
Obviously, exec
-ing the process collector is out; in the worst case, we have no file descriptors for the pipe to the child.
Looping over /proc/$$/fd
is similarly infeasible; opendir(3)
needs a descriptor.
Luckily, an unlikely combination of getrlimit(2)
and poll(2)
does the trick nicely.
Under POSIX-compliant operating systems (Linux, *BSD, OS X, etc.) all process resource usage is constrained by quotas or limits. For example, there is a limit to the number of pending signals a process can have, and another limit for stack size. But the most well-known of these is the number of open files limit, nofile
.
To see your shell's nofile limit, run ulimit -n
:
$ ulimit -n
1024
There are actually two limit values for each type, a soft limit and a hard limit. These are also referred to as the current limit and the maximum limit, because a process can elect to increase its soft (or current) limit up to, but no further than, its hard (or maximum) limit.
The ulimit
shell builtin can show you the soft or hard limit:
$ ulimit -Sn
1024
$ ulimit -Hn
4096
Here, my shell can create up to 1024 files (the current / soft limit). If I want to, I can increase this limit up to 4096 files. 4097, as they say, is right out.
The getrlimit(2)
system call lets us programmatically determine our limits from inside of C.
#include <stdio.h>
#include <sys/time.h>
#include <sys/resource.h>
int main(int argc, char **argv)
{
struct rlimit lim;
if (getrlimit(RLIMIT_NOFILE, &lim) != 0)
return 1; /* bail! */
printf("soft = %lu; hard = %lu\n",
lim.rlim_cur, lim.rlim_max);
return 0;
}
So now, bolo can definitely keep track of its limits. Bo-ring.
The poll(2)
system call is almost exclusively used for I/O multiplexing. You've got a bunch of file descriptors, and you want to pick from the first available for reading and/or writing. The kernel knows all and sees all, so if you ask nicely (and allocate your pollfd
structures properly) it will block until any of them are interesting enough to deal with.
So what does that have to do with tallying the open file descriptors?
As it turns out, if you tell poll(2)
to watch a file descriptor that isn't open, you get back an output event of POLLNVAL
, indicating that this descriptor is invalid, i.e. not a real descriptor.
Don't believe me? Try this out:
#include <stdio.h>
#include <poll.h>
int main(int argc, char **argv)
{
struct pollfd fds[1];
fds[0].fd = 2; /* standard error */
fds[0].events = 0; /* more on this later */
close(2);
if (poll(fds, 1, 0) < 0)
return 1; /* bail! */
if (fds[0].revents & POLLNVAL)
printf("fd 2 is not a real file descriptor\n");
else
printf("fd 2 *is* a real file descriptor (bug...)\n");
return 0;
}
There's a lot going on here, but we'll highlight the important bits.
The 3rd argument to poll(2) is 0. This is an optional timeout parameter, and it specifies the number of milliseconds the kernel will block before returning to the caller. If we were doing normal I/O multiplexing, this value would be -1 (block forever) or some useful value (like 2000ms). Passing 0 causes poll(2)
to return immediately, even if none of the file descriptors are ready, which dovetails with the next point...
The pollfd.events
attribute is also 0. The events
attribute lets us specify whether we are interested in knowing when a file descriptor becomes readable, writable, has an error, etc. Setting it to 0 means "I don't care one wit about the condition of the file descriptor". Remember, we are only interested in that POLLNVAL
error flag.
We close standard error (fd 2). The example doesn't work otherwise. The file descriptor has to be closed for poll(2)
to flag us that it is invalid.
Even though we didn't specify any events, we still get revents. This is the crux of the whole hack: after poll(2)
return immediately, we can see that file descriptor 2 is no longer a valid file descriptor.
Now we are armed with the following critical techniques:
getrlimit(2)
)What if we were to call poll(2)
on all of the possible file descriptors? Since we know what the limit is, and file descriptors are just integers (per POSIX), we can do that:
#include <stdio.h>
#include <stdlib.h> /* for calloc(3) */
#include <sys/time.h>
#include <sys/resource.h>
#include <poll.h>
int main(int argc, char **argv)
{
struct rlimit lim;
if (getrlimit(RLIMIT_NOFILE, &lim) != 0)
return 1; /* bail! */
struct pollfd *fds = calloc(lim.rlim_cur, sizeof(struct pollfd));
if (!fds)
return 1; /* bail! */
int i;
for (i = 0; i < lim.rlim_cur; i++) {
fds[i].fd = i;
fds[i].events = 0;
}
if (poll(fds, lim.rlim_cur, 0) < 0)
return 1; /* bail! */
unsigned long nfds = lim.rlim_cur;
for (i = 0; i < lim.rlim_cur; i++)
if (fds[i].revents & POLLNVAL)
nfds--; /* not a real fd, discount it */
printf("%lu open files\n", nfds);
return 0;
}
Finally, a single system call that can reliably (and quickly) determine how many open file descriptors the current process has. Let's do the breakdown, shall we?
We start with the nofile
limit. This is the soft limit, because the hard limit has no bearing on the number of open file descriptors.
Then, allocate enough pollfd structures to represent all of the possible file descriptors. As before, we set the events
attribut to 0, because we only care about the POLLNVAL
error condition.
Counting is kind of backwards, but it works! Instead of starting at 0 and counting up, I chose to start at $max and decrement the count for every invalid file descriptor. You can do it either way.
No new file descriptors were allocated. In an earlier (and less successful) attempt, I employed epoll(2)
, thinking that it would be able to handle a larger number of file descriptors. Unfortunately, as the man page states:
epoll_create() returns a file descriptor referring to the new epoll instance.
So that won't work when we're completely out of file descriptors...
For my immediate needs (making bolo able to track its own file descriptor usage and alert upon exhaustion) I ended up writing a function called open_files
that returns both resource limits alongside the number of open fds, via output parameters.
The code can be found here.
]]>Pendulum is the virtual machine at the heart of Clockwork 3.x, and provides the flexibility for both configuration management applications (i.e. Clockwork proper) and distributed remote execution and data gathering (Clockwork's exciting new Mesh framework).
For the record, if you want to use Clockwork (and who wouldn't?) you don't have to write any code; Clockwork helpfully translates your policy manifests into Pendulum assembly for you. And if you want to use Mesh, you still don't have to get your hands dirty with code.
No, this post is for the indomitable hacker who likes to get her hands into a thing, up to the elbows if necessary, and figure out how it works.
Let's just get this one out of the way.
fn main
print "Hello, World!\n"
fn
and print
are opcodes, short for operation code. Each opcode instructs the Pendulum VM to perform some computation, and can take up to two arguments. Here, fn
is followed by a label ("main"), and defines the start of a function. Since Pendulum is styled after low-level assembly languages, it lacks block structure and can only support functions by associating function names with their code entry points.
The print
opcodes does just what you think it does; prints its string argument to standard output.
If you've got a build of Clockwork handy, you can use the included pn
utility to compile and run the example:
$ cat hello.pn
fn main
print "Hello, World!\n"
$ ./pn -S hello.pn
$ ./pn hello.pn.S
Hello, World!
Check the man page for pn(1) for more information.
The Pendulum VM is a register machine, not a stack machine. It consists of a set of 16 general-purpose registers, %a - %p, an instruction pointer, and an accumulator.
The set
opcode stores a value into one of the general-purpose registers:
fn main
set %a "Hello, World!\n"
print %a
The print
opcode has a trick up its sleeve. If you give it a format string, it can pull information out of the general-purpose registers, format them and print the result.
fn main
set %a "World"
print "Hello, %[a]s\n"
A format specifier is just the name of a register, inside of '%[...]', followed by a printf-format specifier. Really, the '[...]' is interposed between the '%' and the rest of the specifier. See printf(3)
for details.
The basic arithmetic operators are add
, sub
, mult
, and div
. They each take two arguments, a register and a value. The register holds one of the operands (the leftmost one).
set %a 42
add %a 8 ;; %a == 50
sub %a 17 ;; %a == 33
mult %a 2 ;; %a == 66
div %a 3 ;; %a == 22
The mod
opcode provides arithemtic modulo, the remainder after division. It also takes two arguments, a register and a value:
set %a 9
mod %a 8 ;; %a == 1
Pendulum has several comparison opcodes. eq
, lt
, lte
, gt
, and gte
provide numeric comparison, while streq
compares character strings for equality. Each of these opcodes takes two arguments, compares them to one another, and stores the result as an integer in the accumulator.
The acc
opcode copies the value in the otherwise inaccessible accumulator into a named register:
eq 42 42
acc %a
print "42 == 42 ? acc = %[a]i\n"
Finally, we have the jump opcodes, jmp
, jz
, and jnz
. These directly manipulate the instruction pointer register. All of these take relative offsets (like +1 or -3), which define a number of opcodes to skip over. You can also define labels and jump to them, by name.
It is illegal to jump across function boundaries. Luckily, labels with the same names, in different functions, are distinct.
jmp
is an unconditional jump. It works like goto:
fn main
print "Pendulum is "
jmp +1
print "not " ; never executed
print "very cool\n"
will print Pendulum is very cooln
jz
(jump if zero) checks the accumulator and changes the instruction pointer if the accumulator is 0. On the contrary, jnz
(jump if not zero) only jumps when the accumulator is any other value. If the conditionas are not met, instructions continues with the next op.
Obviously, this lets us implement if / else conditionals:
;;
;; is %a an even number?
;;
fn even?
mod %a 2
eq 0
jz +2
print "%[a]i is even\n" ;; if (a % 2 == 0)
retv 0
print "%[a]i is odd\n" ;; else
retv 1
They also let us implement loops:
;;
;; countdown from %a to 0
;;
fn countdown
again:
eq %a 0
jz boom
print "%[a]i...\n"
sub %a 1
goto again
boom:
print "BOOM!\n"
Pendulum isn't all that picky about whitespace between statements. Sometimes you'll see a conditional and a jump on the same, line. Usually (although not always) the jump is a jz
:
fn test
eq %a 42 jz +1
ret
print "a is not 42\n"
That's idiomatic Pendulum for if a == 42.
We've now got enough Pendulum Assembly under our belts to write the next most obligatory learning-a-new-language example: calculating Fibonacci numbers!
To recap, for the mathematically discinclined, the Fibonacci sequence is:
$$(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 38, 144, 233, 377, …)$$
Each Fibonacci number if the sum of the preceeding two Fibonacci numbers, or,
$$F_n = F_{n-1} + F_{n-2}$$
;;
;; Calculate the nth Fibonacci number
;;
fn fibonacci
gte %n 2 jz +1 ;; F(0) = 1
retv 1 ;; F(1) = 1
sub %n 1 ;; first we calculate n - 1
call fibonacci ;; then F(n - 1)
acc %o ;; storing the value in %o
sub %n 1 ;; then, calculate n - 2
call fibonacci ;; and F(n - 2)
acc %p ;; storing the value in %p
add %o %p ;; add the two values...
retv %o ;; and return the sum
And here's a looping main function that calcualtes the first 14 Fibonacci numbers (any more and we run into stack problems because of the recursion)
fn main
set %n 0
again:
gte %n 14 ;; loop termination; after
ret ;; 14 numbers, we're done.
call fibonacci ;; calculate F(%n)
acc %a ;; store it in %a, temporarily
print "F(%[n]i) = %[a]i\n" ;; and then print
add %n 1 ;; increment %n and
jmp again ;; do it again
And here's the output!
F(0) = 1
F(1) = 1
F(2) = 2
F(3) = 3
F(4) = 5
F(5) = 8
F(6) = 13
F(7) = 21
F(8) = 34
F(9) = 55
F(10) = 89
F(11) = 144
F(12) = 233
F(13) = 377
Looks correct!
To get around the recursion stack overflow problem (try to calculate F(16) and you'll see what I mean), you could find a recurrence relation for Fibonacci numbers. If that type of stuff interests you, I highly recommend Knuth's Concrete Mathematics textbook (most likely available at your local library).
Pendulum is a complete and simple assembly language. The opcodes are straightforward and while the official documentation is still mostly non-existent, you can look through the opcodes.yml file, and src/vm.c on github for more information.
Hopefully in the coming weeks I'll find time to cover the fs.*
opcodes, the authentication database and user/group management.
Here's the code from this article, available for download:
]]>Ringing in the New Year, I joined the CPAN Pull Request Challenge, a loosely organized (ragtag) group of F/OSS enthusiasts of varying levels of Perl expertise who are trying to Make CPAN Better™, one pull request at a time.
The premise is simple enough: every month, neilb (the guy in charge) fires up a script that works through CPAN modules / dists and semi-randomly assigns each participant a module for the month. Participants then have the rest of the month to find one or more ways to make their module better. Work is done, and a Github pull request is submitted to get the changes merged in by the current maintainer.
I got Crypt::OpenSSL::X509, a Perl XS module that glues together libopenssl's certificate handling (the purview of openssl x509 ...
) and presents a familiar, more Perl-y way of parsing certificates.
I believe (but cannot directly substantiate) that the module made it into the candidate list for the challenge because of the following:
When I got the assignment email, I was pretty happy; I love writing C code, XS code is (in my experience) less well-maintained than straight-up Perl, and I had always wanted to tinker with perlguts.
Before I got started writing some code however, I wanted to figure out what needed to be done, and coordinate with the maintainer to see what they would like to see done. The timeline of events looks something like this:
Let's talk about the Github issues for a moment.
The most promising, #28, was an ask for the -nameopt
semantics of the openssl
binary to be ported to the Perl module. The requester had worked around the issue by shelling out via backticks to just run the command and get the output.
The others were not well-suited to something like the Pull Request Challenge; two of them were reporting problems compiling the XS module on Solaris/SPARC platforms. Others dealt with UTF-8 issues that were causing the tests to fail (but not on any platform I have access to). I'm sure these types of very platform-specific issues are confined to XS modules, and may in fact just come with the territory.
This is the email I sent Dan (the maintainer):
My names is James Hunt, and I'm participating in the 2015 CPAN Pull
Request challenge (http://neilb.org/2014/11/29/pr-challenge-2015.html).
I got Crypt::OpenSSL::X509 as my module/project, and I wanted to
reach out to (A) introduce myself and (B) start a conversation about
what needs to be done (if anything) to make the module better.
As I said, I'm James, and I hack Perl for a living. I'm jhunt on
github and JRHUNT on PAUSE. My website is https://jameshunt.us
There, (A) complete!
As for (B), I took a look at the issues on GH -- most of them seem to be
about platforms I don't have access to (OSX / Solaris). The nameopt
feature request seemed interesting, but I wanted to get your feedback on
the direction you want to take with the module before I start writing
any code.
Thoughts?
-jrh
Less than a week later, Dan replied and suggested that I go forward with the nameopt thing, and urged me to contact the original requester to get more information.
After a dive into the OpenSSL codebase (never as fun as it sounds), a few conversations with the original requester on Github, and two false starts on a pull request, #28 got closed without a single line of code being written.
A little more exploration of what Crypt::OpenSSL::X509 led me to the entries()
function of X.509 name objects (both Issuer and Subject names). In the end, everything the requester needed to do it in Perl was already in the module! I commented on the issue, with some working Perl code that did what the requester had wanted to do in the beginning, and emailed Dan that he could close the issue.
Issue closed! No work necessary!
No Pull Request for James…
I still feel like I helped out. Looking into the more platform-specific issues taught me a lot about how CPAN smoke testing works, and managed to get two more issues closed. The dist now only has 3 open issues on Github, down from 6 — that's a 50% reduction in bugs!
Yet the challenge still stands.
I may have to settle for an official 'loss' on this go-round, and hope for a less well-maintained piece of software the next time around. In the meantime, I'll be reading this and this, and hanging out in #pr-challenge on irc.perl.org looking for inspiration.
]]>I love round-robin databases. They neatly solve the problem of efficiently storing time-series metrics. When you provision them, they allocate all the storage they will ever need, so you don't have to worry about monitoring servers filling up at oh-dark-thirty. Each RRD is a self-contained file, making maintenance a breeze.
The only downside is that they don't scale well. The more metrics you collect, the more RRDs you need to store the data in. More RRDs mean more files and more updates. The increasing load on the I/O subsystem can be mitigated via rrdcached, which buffers updates in memory so that more data is written in each write op.
That works, to a point, but eventually you will hit the wall. Eventually, there won't be enough I/O throughput. Eventually, there won't be enough disk. Eventually, the server holding your round-robin databases will die.
We could mirror our data submission, sending RRD updates to multiple servers in concert. This is similar to mirroring disks in a RAID-1 array, or multicasting requests to identical servers in a server farm. Unfortunately for us, this only addresses the failure scenario, and doesn't help us track more data.
When one thing exhibits failure tendencies, we throw more things at the problem. Disks fail, so we put lots of them in a RAID array ensuring that failure of a single drive is non-fatal. Servers fail, so we put lots of them in a load-balanced pool to guard against a complete outage.
We can do the same with RRD, as long as we keep certain things in mind:
Point #1 is important, if somewhat obvious. Existing distributed data systems like Cassandra, Redis, and MongoDB fail us here because they do not allow partial updates. Primarily, this is because partial updates require a priori knowledge of the data being stored, and these systems are commodity infrastructure.
The second and fourth points are related. When a client submits an update for RRD $X$, we need to ensure that we always give that update to the node responsible for that RRD. Routing it anywhere else serves little purpose; the other nodes don't know anything about the historical data that the update applies to. Worse, if the node that does handle the RRD misses the update, you have data loss.
We can honor point #2 by using a hashing strategy to translate an RRD filename into a responsible node. A naïve design may rely on modulo operations to turn any hashed value into an index into a list of nodes.
Mathematically, it looks something like this:
$$L(k) = S\_{H(k)\:\:mod\:\:n}$$
$L(k)$ is the locator function, which translates an RRD filename key, $k$, into a node address, in the range $S_0…S_{n-1}$ (assuming that we have $n$ nodes). $H(k)$ is a hash function that maps an arbitrarily long string of characters (our filename) into a number.
We'll come back to why this is wrong later.
Point #3 is vital to the viability of the distributed storage system because it is the only way we can protect against server failure. If we distributed our databases across 12 servers, each with a total failure rate of once every 12 months, we will see, on average, an outage every single month. Without replicating the RRDs to other servers, we will suffer data loss, and it will happen faster than you would intuitvely expect.
Let's go back to hashing for a minute.
$$L(k) = S_{H(k)\:\:mod\:\:n}$$
Earlier, I pointed out that this was wrong. To understand why, let's look at some real hash values and the distribution across our nodes.
Here is Dan Bernstein's djb2 hash function:
unsigned long
hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c;
return hash;
}
Let's start out by taking a set of 7 RRD filenames, and hashing them with djb2 to see how they spread. For brevity's sake, I'm going to drop all but the least-significant 8 bits, to get a number between 0 and 255 (inclusive).
"host01.example.com:cpu" = 77
"host01.example.com:memory" = 62
"host01.example.com:load" = 229
"host02.example.com:cpu" = 78
"host02.example.com:memory" = 159
"host02.example.com:load" = 6
"www.example.com:requests_per_second" = 168
In other words, $H(\text"host01.example.com:cpu") = 77$. Now, assuming we have $n = 3$ nodes, (and that $S$ ranges from $S_0$ to $S_2$), we can calculate the values for $L(k)$:
$$L(\text"host01.example.com:cpu") = S_{H(\text"host01.example.com:cpu")\:\:mod\:\:3}$$
$$= S_{77\:\:mod\:\:3}$$
$$= S_2$$
Likewise, we can calculate the location of all the other RRD filename keys:
L(host01.example.com:cpu) = 77 mod 3 = 2
L(host01.example.com:memory) = 62 mod 3 = 2
L(host01.example.com:load) = 229 mod 3 = 1
L(host02.example.com:cpu) = 78 mod 3 = 0
L(host02.example.com:memory) = 159 mod 3 = 0
L(host02.example.com:load) = 6 mod 3 = 0
L(www.example.com:requests_per_second) = 168 mod 3 = 0
Seems to be working so far. Armed with only knowledge of the locator function $L(k)$, clients can reliably and repeatably figure out what node they should direct their updates at.
But all is not well! Don't forget about point #4! What happens if we add a new node into the mix? Assuming we can inform all clients about the new topology, how will it affect hashing? The introduction of a new member of $S$ changes the range of $S$, and the value of $n$, which you'll recall was used as part of the definition of $L(k)$. Now, $n = 4$.
Let's revisit host01.example.com:cpu:
$$L(\text"host01.example.com:cpu") = S_{H(\text"host01.example.com:cpu")\:\:mod\:\:\bo4}$$
$$= S_{77\:\:mod\:\:\bo4}$$
$$= S_\bo1$$
The hash function $H(k)$ didn't change, so our hashed value is still 77, but the change of $n$ from 3 to 4 perturbed the results of the modulo operation, causing host01.example.com:cpu to move from $S_2$ to $S_1$. Naturally, we expect some part of the keyspace to be redistributed, otherwise adding a new node would have zero effect. So just how many keys migrated?
L(host01.example.com:cpu) = 77 mod 4 = 1
L(host01.example.com:memory) = 62 mod 4 = 2 # same
L(host01.example.com:load) = 229 mod 4 = 1 # same
L(host02.example.com:cpu) = 78 mod 4 = 2
L(host02.example.com:memory) = 159 mod 4 = 3
L(host02.example.com:load) = 6 mod 4 = 2
L(www.example.com:requests_per_second) = 168 mod 4 = 0 # same
Turns out, a little over half of all keys migrated to new hosts. Intuitively, we would have expected something close to a fourth, since the new node (given uniform distribution of keyspace) should be responsible for only one quarter of the entire keyspace. Looking at the recalculated results above, the new node ($S_3$) is in fact only responsible for one of our seven example keys.
In real life, this mass-migration would cause many I/O-intensive and network-heavy transfers of RRD files. Ideally, we would like to minimize this as much as possible.
Mathematically speaking, the problem with the modulo-hash strategy is that the resulting value depends highly on $n$, which we expect to be able to change up or down as we see fit. What we need is a hashing strategy that doesn't depend on the number of nodes in $S$, and that's what consistent hashing gives us.
The scholarly literature on consistent hashing, primarily David Karger's 1997 ACM paper and Daniel Lewin's 1998 Thesis, develop the idea of a consistent hashing function in the context of scalable web cache architecture. This allows them to sidestep certain issues that are unavoidable in authoritative data storage systems, like the one we're trying to build.
Nevertheless, the theory is sound, so let's jump right in!
It starts with a circle.
If we define a new function, $r(h)$, which takes an arbitrary number and maps it to a point on the circle, we can feed it the result of our hash function, like so:
$$r(H(k)) = (x_k, y_k)$$
(For now, don't worry about how we actually implement $r(h)$. Also, because djb2 doesn't avalanche well for short keys, we'll be switching to a modified SHA1, where we take the first 8-bits of the hash result.)
Assume that we calculate $r(H(k))$ for all of our RRD filename keys, and we get this:
Next, we assign our storage nodes from $S$ to points $A$, $B$ and $C$ on the ring:
To figure out which node $S_x$ is responsible for a key $k$, we find the point $r(H(k))$ on the ring, and then visit each subsequent point, in a clockwise direction, until we find one that corresponds to a node.
For example, let's assume that we hash host02.example.com:memory, run it through $r(h)$, and it spits out the point labeled $4$ in the diagram. Starting there, we walk the ring clockwise, passing point $7$, until we reach $B$.
Effectively, this means that each node point "owns" all key points that precede it. The exciting (and useful!) quality of this is that adding other nodes will not affect the placement of the existing nodes, since each node's point on the ring is determined solely by the hash function $H(k)$ and information intrinsic to the node itself. In our case, we are hashing the node's fully-qualified, canonical hostname.
Looking again at the diagram, we come up with the following ownership table:
A | 3 | ||||||
---|---|---|---|---|---|---|---|
B | 2 | 4 | 5 | 7 | |||
C | 1 | 6 |
Visually, we can see that $B$ accounts for more than 50% of the ring; our ownership table bears this out. We can correct this disparity by adding another node, and hope that it bisects the arc between $A$ and $B$ (preferably between key points $5$ and $4$).
Just for fun, let's go ahead and add that fourth $D$ node, to demonstrate how resilient consistent hashing is:
And here's the updated ownership table:
A | 3 | ||||||
---|---|---|---|---|---|---|---|
B | 4 | 7 | |||||
C | 1 | 6 | |||||
D | 2 | 5 |
Adding new nodes just to balance out statistical clustering is a pretty raw deal. Luckily, there's another way. The more node points we add, the smoother the distribution will be across nodes. What if, instead of adding real, physical nodes, we just assign multiple points to each node? We'll call them virtual nodes:
A good scheme (assuming that we chose an $H(k)$ with good avalanche properties), is to prefix the node name with a number from $0$ to $v - 1$, where $v$ is the number of virtual nodes to create per physical node. Whatever differentiating algorithm, just make sure the clients all know (and agree on) it!
Note that each node now exists as four separate points on the ring. The same ownership behaviors apply (walk the ring clockwise until you hit a node point), so we can regenerate our ownership table:
A | 4 | 7 | |||||
---|---|---|---|---|---|---|---|
B | 1 | 2 | |||||
C | 3 | 5 | 6 |
That's a much more even distribution of the keyspace, and it gets more even as (1) the hash function gets better and (2) the number of virtual nodes goes up.
We can actually use this virtual node trick to weight some nodes differently than others. Let's suppose that node $A$ has twice as much capacity (both I/O and raw disk storage) as the other two nodes. If we assign twice as many virtual nodes (8 instead of 4) to $A$, we should see an increase in the amount of keyspace that $A$ is responsible for:
And here's the new ownership table:
A | 1 | 3 | 4 | 7 | |||
---|---|---|---|---|---|---|---|
B | 2 | ||||||
C | 5 | 6 |
As expected, $A$ now holds twice as much keyspace as either of the other two nodes!
]]>I work in monitoring, and one of the things that I have to deal with is process metrics. People care (or at least should care) about all kinds of things related to their processes. Is it running? Is there only one? How many open files does each have? How many threads are in the parent process?
How much memory is the process (and its children) using?
That last one is one of the trickiest, owing in no small part to the sophistication of modern memory management systems.
But top
shows that, right? Kinda.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22522 jrhunt 20 0 2305908 100820 31824 S 4.6 1.7 0:26.07 banshee
2760 root 30 10 532132 58300 1584 S 3.3 1.0 96:25.44 vdo-partial-upgr
2889 jrhunt 20 0 1809616 322004 24348 S 3.3 5.4 231:43.15 compiz
25642 jrhunt 20 0 2171372 668276 67440 S 1.7 11.1 153:55.94 firefox
1392 root 20 0 435612 96008 61960 S 1.3 1.6 104:41.43 Xorg
2852 jrhunt 20 0 437844 6448 3612 S 1.0 0.1 19:55.79 pulseaudio
2989 jrhunt 20 0 408736 11316 7912 S 1.0 0.2 72:32.86 indicator-multi
427 root -51 0 0 0 0 S 0.7 0.0 16:55.50 irq/44-iwlwifi
2789 jrhunt 20 0 669608 52004 12692 S 0.7 0.9 46:39.71 unity-panel-ser
907 root 20 0 0 0 0 S 0.3 0.0 7:35.90 rts5139-polling
Easy as pi. Here, banshee
is using 2.3G of virtual memory and 100M of RAM, right?
Hardly.
Unless banshee is statically compiled, it's sharing memory with other processes that are using the same dynamic libraries. To see exactly what ranges of the process address space are mapped to what object files / memory-mapped files, look at /proc/$$/maps
:
... snip ...
7fd65ccf1000-7fd65ccf2000 rw-p 00105000 08:02 1845997 /lib/x86_64-linux-gnu/libm-2.19.so
7fd65ccf2000-7fd65cd15000 r-xp 00000000 08:02 1845995 /lib/x86_64-linux-gnu/ld-2.19.so
7fd65cd15000-7fd65cd16000 r--s 00000000 08:02 5359968 /var/cache/fontconfig/1ac9eb803944fde146138c791f5cc56a-le64.cache-4
7fd65cd16000-7fd65cd1a000 r--s 00000000 08:02 5251626 /var/cache/fontconfig/4d6aee6d44eccb37054d3216e945f618-le64.cache-4
7fd65cd1a000-7fd65cd29000 r--p 00000000 08:02 1576851 /usr/lib/mono/gac/Mono.Cairo/4.0.0.0__0738eb9f132ed756/Mono.Cairo.dll
7fd65cd29000-7fd65cd5c000 r--p 00000000 08:02 3545527 /usr/lib/banshee/Hyena.dll
7fd65cd5c000-7fd65cd7a000 r--p 00000000 08:02 832305 /usr/lib/mono/gac/dbus-sharp/1.0.0.0__5675b0c3093115b5/dbus-sharp.dll
7fd65cd7a000-7fd65cd90000 r--p 00000000 08:02 832327 /usr/lib/mono/gac/glib-sharp/2.12.0.0__35e10195dab3c99f/glib-sharp.dll
7fd65cd90000-7fd65cda0000 rw-p 00000000 00:00 0
7fd65cda0000-7fd65cda2000 r--s 00000000 08:02 5358353 /var/cache/fontconfig/767a8244fc0220cfb567a839d0392e0b-le64.cache-4
7fd65cda2000-7fd65cda7000 r--s 00000000 08:02 5353921 /var/cache/fontconfig/7ef2298fde41cc6eeb7af42e48b7d293-le64.cache-4
7fd65cda7000-7fd65cdb0000 r--p 00000000 08:02 3545605 /usr/lib/banshee/Extensions/Banshee.Fixup.dll
7fd65cdb0000-7fd65cdc0000 rw-p 00000000 00:00 0
... snip ...
Since the ld-2.19.so
library is mapped in r-xp
mode — i.e. not writable — it's a safe bet that anything else mapping that library is sharing pages with us. Even libm-2.19.so
, which is writable, is probably also sharing pages, since the p means private with copy-on-write semantics.
This is noteworthy. If the process does write to the libm's memory region, the kernel will craftily allocate a new page, copy the original data to it, and present that to the process. This new page will belong exclusively to us, and we should count that page towards our overall process memory footprint, while still discounting the untouched pages that are still shared.
Enter /proc/$$/smaps
.
A coworker showed me this file, and its exhaustive accounting of memory mappings that makes the maps
file look like a a tweet from the kernel (#proc #systemstats #YOLO)
7fd65ccf1000-7fd65ccf2000 rw-p 00105000 08:02 1845997 /lib/x86_64-linux-gnu/libm-2.19.so
Size: 4 kB
Rss: 4 kB
Pss: 4 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 4 kB
Referenced: 4 kB
Anonymous: 4 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd wr mr mw me ac sd
7fd65ccf2000-7fd65cd15000 r-xp 00000000 08:02 1845995 /lib/x86_64-linux-gnu/ld-2.19.so
Size: 140 kB
Rss: 124 kB
Pss: 1 kB
Shared_Clean: 124 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 0 kB
Referenced: 124 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd ex mr mw me dw sd
Yeah, that's way more information than we're used to from the old maps file. But what does it all mean? You can get some of the basics from proc(5), but my local copy didn't even mention the Pss
column.
Why not experiment?
(I'm glad you asked. What follows is a safari into the uncharted jungle of memory management, aided by our trusty friend Exploratory Programming in C!)
I love exploratory programming. Home directories on hundreds of servers I've managed over the years are littered with files named t.pl or x.sh. What better way to verify a language or OS feature than to just knock out a quick script and see what happens?
Naturally, when confronted with the compelling but coy smaps file, my first thought was to write a small test script. But Perl and Bash just would not do — we need something close to the machine. Ah yes, C. For the level of control over memory, and its type (heap-allocated, static, stack-allocated, etc.), C is the easiest and best choice.
For those of you who (like me) have to get your hands into things, I've posted all of the code, along with a Makefile, to github.
To make this easier, I've written a helper script and some convenience C functions; these are intended to make it easier to see both what is going on with the memory mappings, and to distill the test cases down to their base essentials.
/proc/$$/smaps
size
bytes of data on the heap.size
bytes on the heap.len
bytes of buf
.Everything has overhead, even close-to-bare-metal C programs. Before we start experimenting, let's establish a baseline with the null
program:
#include "lib.c"
int main(int argc, char **argv)
{
return interlude();
}
Why yes, I am including another C source file (lib.c) — that's not a typo. Doing it this way, instead of as another translation unit, makes the build process simpler; just issue a make null
to build null.c!
Here's a sample run:
$ ./null
pid 11091
------------------------------------------
go check /proc/11091/smaps; I'll wait...
press when you're done
With that running in one console, here's what the diag script gives us:
$ ./diag 11091
[mmap]:
private 4.0 k [clean] 48.0 k [dirty]
shared - [clean] - [dirty]
[stack]:
private - [clean] 16.0 k [dirty]
shared - [clean] - [dirty]
Without actually doing anything, we've allocated 16 kilobytes of stack and 52 kB of mmapped memory, 4 kB clean and the remaining 48 kB dirty. That memory is probably brought in by libc (the only library we've linked with).
For our first experiment, let's see what happens when we allocate memory statically:
#include "lib.c"
static char buf[32 MB] = {0};
int main(int argc, char **argv)
{
randomize(buf, 32 MB);
return interlude();
}
(check lib.c for the definition of the MB macro)
The 32 MB buffer is static; which means that 32 megabytes of buffer space will be reserved in the .data
segment of the ELF image. This should show up against the [mmap] section, because the binary image is memory-mapped into the address space of the running process.
[mmap]:
private 4.0 k [clean] 32.0 M [dirty]
shared - [clean] - [dirty]
[stack]:
private - [clean] 16.0 k [dirty]
shared - [clean] - [dirty]
Sure enough, there's 32M of dirty private memory. If you recall our null test earlier, the stack still has 16 kB allocated, and we see our 4 kB of private clean mmap'd memory.
Static allocation is boring. What happens if we allocated on the stack?
#include "lib.c"
int main (int argc, char **argv)
{
char buf[28 KB] = {0};
randomize(buf, 28 KB);
return interlude();
}
(again, look at lib.c for the KB macro definition)
Simple enough; allocate 28 kilobytes on the activation record for main(). Starting from our null baseline of 16 kB stack, we should expect the stack to increase by 28 kB to 44 kB, all of it private and marked as dirty.
Here's what ./diag
has to say:
[stack]:
private - [clean] 44.0 k [dirty]
shared - [clean] - [dirty]
The heap, or free store, is an area filled with blocks of memory (often of different sizes) where calls like malloc
, calloc
, realloc
, and friends play. Let's see if we can do some heap allocations and see what smaps does.
#include "lib.c"
int main(int argc, char **argv)
{
dirty(16 MB);
clean(32 MB);
return interlude();
}
Here we've allocated a total of 48 mb on the heap; 16 mb will be changed, and we'll leave 32 mb untouched. As we'll see, however, that doesn't matter for heap accounting - when you get it, it's in RAM whether or not you use it.
[heap]:
private - [clean] 48.8 M [dirty]
shared - [clean] - [dirty]
See?
If you look closely, you'll see we actually have 48.8 mb in our heap mapping. This is overhead of our malloc implementation. At some point, the breakpoint had to be moved to accomodate our memory requests, via sbrk(2)
. Rather than request the exact amount required, the glibc malloc implementation requests a bit extra, to handle future requests.
When a process forks, it becomes two processes, executing the same code, with the same memory segments. Older UNIX implementations naively copied every single page of memory from the parent to the child, and they resumed execution with completely different memory.
Turns out, the normal use case for fork(2)
is to follow it immediately with an execve(2)
(or a variant) int the child process, effectively blowing away all that carefully copied memory and starting over with a new program image.
BSD initially solved this by punting the optimization back to the programmer; they introduced the vfork(2)
system call that skipped all that memory copying. Linux (and eventually all the other modern UNIXes, including BSD) opted for a thing called copy-on-write, or CoW.
Copy-on-write semantics work like this: child address space is mapped to the same backing pages (RAM) as the parent, except that when the child attempts to write to one of those pages, the kernel transparently copies the memory contents to a new, dedicated page, before carrying out the write.
This speeds up fork+exec considerably; no more time wasted copying all that memory. It also speeds up fork in other scenarios, especially when the child ignores most of its parent's memory space.
The downside: it really complicates memory usage analysis.
Let's say you want to know how much memory your Apache web server is using, including the master listener process and all of the child worker processes. (if you don't like Apache, substitute your favorite multi-process system).
Clearly, you need to know how much heap / stack / mmap'd memory the parent process is running, and we now know how to account for that. Looking at the child processes brings up a different set of problems, and brings us to the distinction of private memory vs. shared memory.
But first, an example:
#include "lib.c"
int main(int argc, char **argv)
{
char *buf = malloc(64 KB);
randomize(buf, 64 KB);
interlude();
pid_t pid = fork();
assert(pid >= 0);
if (pid == 0) {
randomize(buf, 16 KB);
return interlude();
}
int st;
waitpid(pid, &st, 0);
return st;
}
The parent allocates 64 kB on the heap, and then forks a child process which will update 16 kB of that buffer. This should trigger our copy-on-write semantics nicely.
You'll notice we call interlude()
twice; once right after the parent allocations, but before the fork, and again at the end. This lets us inspect the state of the parent before the fork:
[heap]:
private - [clean] 68.0 k [dirty]
shared - [clean] - [dirty]
(I've left off the [stack] section, as it has no bearing on CoW semantics)
As expected, there's 64 kB of heap usage (plus 4 kB for malloc overhead). Pressing
at the interlude prompt will continue on with the fork and subsequent child shenanigans, and then we can inspect the parent process again:
[heap]:
private - [clean] 20.0 k [dirty]
shared - [clean] 48.0 k [dirty]
After the fork, the parent is now sharing 48 kB with the child process, which corresponds to the parts of buf
that have not yet been overwritten by the child process. 20 kB of heap (really, 4 kB of overhead and 16 kB of buf
) is now marked as private.
Let's look at the child:
[heap]:
private - [clean] 20.0 k [dirty]
shared - [clean] 48.0 k [dirty]
It looks identical, because it's sharing the same amount of memory with the parent (those last 48 kB of buf
), and has its own copy of heap overhead (4 kB) and the 16 kB that it overwrote.
So far we've looked at static allocation, stack allocation and heap allocation. Now, let's turn to the fourth allocation discipline, mmap(2)
The only system call in our line-up, mmap(2) creates a new mapping in the process address space. It is directly responsible for the creation of a new section in our smaps file. These mappings can be file-backed or anonymous. They can be marked readable, writeable, executable or a combination thereof. They can be reserved for private use, or shared between processes.
Here's mmap.c, an experiment to see what happens with each type of mapping:
#include "lib.c"
int main(int argc, char **argv)
{
/* inert map (never modified) */
char *inert = mmap(NULL, 16 KB,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE,
-1, 0);
/* anonymous, private mmap */
char *anon_priv = mmap(NULL, 32 KB,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE,
-1, 0);
randomize(anon_priv, 32 KB);
/* anonymous, shared map */
char *anon_shared = mmap(NULL, 64 KB,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_SHARED,
-1, 0);
randomize(anon_shared, 64 KB);
/* private, file-backed map */
int fd = open("data/256k", O_RDWR);
assert(fd >= 0);
char *file = mmap(NULL, 256 KB,
PROT_READ|PROT_WRITE,
MAP_PRIVATE,
fd, 0);
randomize(file, 128 KB);
return interlude();
}
The code is long, because of the line-wraps, but it's not too complicated.
First, we create a new anonymous private map of 16 kB, which we never use. Anonymous maps can only be backed by physical RAM (or swap space), because they have no file backing. The Linux kernel will set up the address range, but doesn't actually allocate any physical RAM, since we haven't tried to write to any address in the new range.
Second, we create another anonymous private map. This one will be 32 kB, and we will write to all of it; forcing the kernel to provide pages in RAM for us.
Then, we create a 64 kB anonymous map and initialize it with data. This mapping will be marked as shareable, via MAP_SHARED.
Finally, we open a 256 kB file in read/write mode, and map it into memory space. For fun, we'll go ahead and write all over the first half of the file (128 kB), to force the kernel to retrieve pages from disk and stuck them in RAM.
Here's what diag has to say:
[mmap]:
private - [clean] 272.0 k [dirty]
shared - [clean] - [dirty]
272 kB works out to exactly 48 + 32 + 64 + 128. Recall that our baseline null
program had 48 kB of mmap'd memory, probably from glibc or the C runtime itself. Add the 32 kB anonymous private map, the 64 kB anonymous shared map and half (128 kB) of the file map, and you get 272 kB!
How the kernel handles the file-backed memory map is interesting, and speaks to the power of mmap(2). Since the process only wrote to the first 128 kB of the file, the kernel only had to allocate memory for that section. This also proves that the smaps file only reports pages that are actually in RAM.
So, [heap] is for malloc()
and [mmap] is for mmap()
.
If only it were that simple.
If you check the man page for malloc(3)
, you'll find this nifty little gem in the NOTES section:
Normally, malloc() allocates memory from the heap, and adjusts the size
of the heap as required, using sbrk(2). When allocating blocks of memory
larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation
allocates the memory as a private anonymous mapping using mmap(2).
MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3).
So, if you try to heap-allocate more than 128 kB (using the default MMAP_THRESHOLD), malloc() sneakily calls mmap() to do the heavy lifting. Normally this is totally fine and transparent, but it does change the way memory is reported.
To illustrate, here's a test case that allocates two buffers, one under the 128 kB threshold, and one over:
#include "lib.c"
int main(int argc, char **argv)
{
char *under = malloc(96 KB);
randomize(under, 96 KB);
char *over = malloc(256 KB);
randomize(over, 256 KB);
return interlude();
}
Normally, we would expect this to report ~352 kB (95 kB + 256 kB) in the [heap] section.
[heap]:
private - [clean] 100.0 k [dirty]
shared - [clean] - [dirty]
[mmap]:
private - [clean] 308.0 k [dirty]
shared - [clean] - [dirty]
You can plainly see that the 96 kB allocation was done in [heap], while the 256 kB passed the MMAP_THRESHOLD and was mmap'd. The 308 kB number comes from our baseline mmap of 56 kB + 256 kB from the malloc() call.
Hopefully you've learned a bit about how memory is organized inside a running Linux process, and how the kernel handles the different allocation methods and process forking.
Here's few parting thoughts on using the data in smaps to your advantage:
Happy Hacking!
]]>A few weeks ago, I ran into an interesting bug in some production daemon code that was being caused by weird network behavior.
This is what was happening on the wire (courtesy of tcpdump
). Addresses and ports have been changed to protect the guilty, and some whitespace added to improve readability.
1 0.000000 10.0.1.8 -> 10.0.0.5 TCP 74 33892 > 1234 [SYN] Seq=0 Win=5840 Len=0
2 0.000004 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33892 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
3 0.000466 10.0.1.8 -> 10.0.0.5 TCP 66 33892 > 1234 [ACK] Seq=1 Ack=1 Win=5840 Len=0
4 2.057643 10.0.1.8 -> 10.0.0.5 TCP 74 33894 > 1234 [SYN] Seq=0 Win=5840 Len=0
5 2.057656 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33894 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
6 2.058656 10.0.1.8 -> 10.0.0.5 TCP 66 33894 > 1234 [ACK] Seq=1 Ack=1 Win=5840 Len=0
7 15.614608 10.0.1.8 -> 10.0.0.5 TCP 74 33899 > 1234 [SYN] Seq=0 Win=5840 Len=0
8 15.614615 10.0.0.5 -> 10.0.1.8 TCP 74 1234 > 33899 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0
9 15.614991 10.0.1.8 -> 10.0.0.5 TCP 74 33899 > 1234 [ACK] Seq=1 Ack=1 Win=5840 Len=0
That pattern continues. A SYN
packet from the client, the SYN
+ACK
response, and the final ACK
to nail up the connection, and then... nothing.
The upshot for the server is that the finite number of connection slots in the daemon's memory structures filled up over time with essentially dead connections from hosts like this one. The fix was easy — add a timer to each slot and eject clients that tarry for too long without sending us any real data.
But that's not the point of this post.
The interesting part of the whole ordeal was reproducing the network issue post-incident, to ensure that my proposed fix to the daemon connection-handling logic would withstand hours of such misbehavior.
The problem is we don't know exactly what happened. Graphs of per-process open file descriptor usage helped us to poinpoint when the daemon started leaking fds. We think we know why it happened. We know that a hypervisor / virtual switch upgrade was taking place around the same time as the ramp-up period. We also know that we had seemingly unrelated network outages in backbone connectivity to other data centers.
Literally all we have to go on is the tcpdump we took while things were broken. The challenge was to introduce network outage in a controlled fashion, inside of our development infrastructure.
Naturally, I turned to iptables
, which can do more than just keep out intruders. As a consummate inspector of packets, it can be used to emulate specific network problems, with a surprising degree of accuracy.
For starters, here's the shell script I start all iptables experimentation with:
#!/bin/sh
IPTABLES="/usr/bin/sudo /sbin/iptables"
$IPTABLES -F
# rules go here
$IPTABLES -L -nv
echo
echo "The firewall is up. Ctrl-C if want to keep it up;"
echo "otherwise it will be automatically dropped in 4 minutes."
sleep 240
echo "Dropping firewall (failsafe)"
$IPTABLES -F
echo "Dropped."
Whenever I deal with firewalls, especially on virtual machines and remote physical servers where it can be difficult to get a non-networked console, I build in a deadman switch. After four minutes, without human intervention, the nascent firewall will be dropped. This ensures that if I do bork the connection all up (a distinct possibility), the box will "fix" itself after some time. Four minutes is enough time to switch to another terminal and verify that I can get back in over SSH.
In my case, my dev environment is set up with lots of other machines that are constantly connecting to my dev server, just like it was a prod service. As such, I won't be going into generating network load or anything like that.
We'll start off with (incorrect) rules that outright block a single host in our dev network, and verify that everything else is kosher:
VICTIM=10.44.0.6
PORT=1234
$IPTABLES -p tcp --dport $PORT --src $VICTIM -j DROP
That works (except that it doesn't). If we raise that firewall (and Ctrl-C to make it permanent), we should see that connections from our hapless victim at 10.44.0.6 are being silently dropped; no RST packet will be sent back.
Cool. But that's not the observed behavior. We were seeing the SYN
/ SYN+ACK
/ ACK
three-way handshake complete, we just weren't seeing any data packets. Enter the --tcp-flags
option.
Activated whenever you set the protocol match to TCP (via -p tcp
), this handy option lets you accept or reject packets based on the TCP flags set in each. It takes two arguments. The first is the set of flags you want to consider, given by their symbolic name (SYN
ACK
URG
PSH
etc.). The second is the subset of those that must be set for the rule to match.
The canonical example (and about the only one you'll find by googling it) is BADFLAGS. Some flag combinations are nonsensical. Take SYN+FIN
and SYN+RST
for example. "Set me up a connectiont that should be torn down".
Here's the full rule:
iptables -N BADFLAGS
iptables -A BADFLAGS -j LOG --log-prefix "BADFLAGS: "
iptables -A BADFLAGS -j DROP
iptables -N TCP_FLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,FIN FIN -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,PSH PSH -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ACK,URG URG -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags FIN,RST FIN,RST -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags SYN,FIN SYN,FIN -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags SYN,RST SYN,RST -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL ALL -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL NONE -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL FIN,PSH,URG -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL SYN,FIN,PSH,URG -j BADFLAGS
iptables -A TCP_FLAGS -p tcp --tcp-flags ALL SYN,RST,ACK,FIN,URG -j BADFLAGS
That's not how we want to use this flag, though. What we need to do is allow SYN
, SYN+ACK
and ACK
, but disallow most everything else (including the FIN packet and any data packets).
More formally, we should:
ACK
flag setSYN
flag setThis translates quite nicely into our iptables calls:
VICTIM=10.44.0.6
PORT=1234
$IPTABLES -p tcp --dport $PORT --src $VICTIM --tcp-flags ALL SYN -j ACCEPT
$IPTABLES -p tcp --dport $PORT --src $VICTIM --tcp-flags ALL ACK -j ACCEPT
$IPTABLES -p tcp --dport $PORT --src $VICTIM -j DROP
Perfect!
If we run a tcpdump with this firewall up, we should see the same SYN
/ SYN+ACK
/ ACK
pattern we saw from our production packet capture.
iptables is a flexible filter that operates on packets. When combined with the insight gained from tcpdump, you can turn it into a powerful development tool.
]]>A few years ago, I picked up a handful of Linksys OpenWRT routers on ebay for a song. I made one into a wireless access point. Another formed the backbone network switch which provided VPN connectivity from living room to server room and the third was a backup for the second.
Then I tried upgrading them.
I'd rather not speculate as to what happened or why, but suffice it to say I had to buy a cheap D-link to get back on the Interwebs.
This, then, is the saga of resurrecting these devices from the brickyard, and it all starts with the need for a console. A serial console.
A few years ago, I picked up a handful of Linksys OpenWRT routers on ebay for a song. I made one into a wireless access point. Another formed the backbone network switch which provided VPN connectivity from living room to server room and the third was a backup for the second.
Then I tried upgrading them.
I'd rather not speculate as to what happened or why, but suffice it to say I had to buy a cheap D-link to get back on the Interwebs.
This, then, is the saga of resurrecting these devices from the brickyard, and it all starts with the need for a console. A serial console.
Turns out, the WRT devices don't have consoles. But they do have the unpopulated holes in the PCB for hooking up a serial port, which is just as good (with some soldering and a certain level of fearlessness).
(Note: There are much better teardown sites out there, like this one, and this one. I'm publishing this bit here primarily for my own recollection, later)
The router case is three separate pieces of plastic. The purplish-blue front piece, at least in the WRT54Gv4 and GLv1.1 units I have, is just compression fit with two tabs on the underside. It comes off with a modest amount of force. I find it helpful for new units that have never been opened to start with the corner on the left (when the unit is on its back).
If you look closely, you can see the bottom plate of the main housing separating slightly from the top (which is, confusingly enough, on the bottom, since the unit is upside-down).
With a little persuasion, we can get those two separated.
Two screws hold the mother board into the three slide clips. With those out, you can slide the motherboard down (towards you), and the board should come free. Time to solder!
So Linksys didn't package an RS232 DB-9 serial port on the home router. You can't fault them for not wanting to confuse the unwashed masses. To their credit, they did leave us 10 pinholes for two TTL serial connections!
Really, there are two separate connections; the odd-numbered pins 1, 3, 5, 7 and 9 (the bottom row in the above image) form one, and the even pins form another.
First things first, let's get some headers on those empty holes!
I picked these up from Mouser for next to nothing (plus shipping). I didn't realize that they would come in two- and three-header groups. I was kind of expecting one big run that I would have to break apart myself.
My wife was volunteered as my lovely assistant for this next bit. She steadied the motherboard, and held the headers in place from underneath with a pair of flat-nosed pliers. The components on the motherboard make it impossible to use the work surface to keep the pins in the holes for soldering, so another pair of hands was most useful.
I used a 40-watt soldering iron and some 63/37, .015" silver-bearing solder. I find that smaller diameters give better feed control. Too much solder is often worse than not enough!
The second row went in easier than the first; I was able to use a small binder clip to clamp the loose headers to the other row. This made the piece a lot easier to move around to get the best angle on soldering.
And, for those of you who may doubt my mad l33t soldering skills in the future, here's the finished product, which looks pretty good if I do say so myself:
Next up, it's time to cut a hole in the case somewhere and mount the DB-9 connector and the MAX3232 upconverter chip+board. Did I mention that this is a permanent mod?
I spent hours trawling the web looking for ideas and options for mounting the serial ports. One guide mounted it on the passive air vent on top of the unit, but since I hope to stack my units, this is less than ideal. Another option was to remove the faceplate in the front and mount it there. Other people avoided the entire problem of where to put the port by just drilling a hole and running a serial cable through that.
In the end, I decided I wanted an accessible, yet unobtrusive flush-mount, on the side. Partially, this was so that I didn't ruin the aesthetic appeal of the case too much. Also, it was the only option that didn't make it difficult-slash-impossible to take the case apart again.
Most people use a Dremel multi-tool, or some other power grinder/cutting utility. I chose to cut the hole in my case by hand. While it is definitely faster with power tools, nothing beats the level of control that doing it manually gets you.
To start out, we first have to cut a hole in the side of the case. I used an old quick-change drill bit that fit into a hex-mount screwdriver. The first step is to scratch a small divot in the center of the hole:
This gives the drill bit a place to rest when it starts to cut into the plastic. From here, start to slowly drill into the side of the case. At first nothing much will happen. Soon, the bit will catch and you'll start to make progress:
With the pilot hole drilled clean through, it's time to bring out the files:
I picked these up years ago for under $10 at the local hardware store, ostensibly to do woodworking. While they're much too small for that, they work wonders on projects like this. I used three types: a flat file for removing lots of material quickly, a square file for cutting in the corners and screw-mount holes and a round file for finishing.
With the square file, I cut in some corners, turning the round hole left by the drill bit into a square hole just large enough to get the flat file in:
Then, I switched to the flat file to widen the hole across the depth of the case ("up" in the picture) until I got the rough size of the cut-out.
Once I was satisfied with the width of the hole, I started to expand it in the other direction, still with the flat file.
It's a good idea to periodically check the cut-out against a serial cable, since the cowl will need to fit inside the hole.
Keep expanding the cut-out until the cable fits juuust right.
If you position the RS232 port where it is supposed to be mounted, you'll notice that we forgot the screw-holds! Switching to the square file, held on its point (more like a diamond, really), I cut the channels that they will fit into.
From here on out its just a matter of trying to dry-fit the port + serial cable, figure out what's impeding the connection and filing that bit down, until it all just fits.
Once you're happy with the cut-out and the fit, it's time to warm up the hot glue gun and make it all real permanent.
First, however, I want to take a moment and talk about hot glue.
It's awesome.
I was skeptical at first. I toyed with the idea of running bolts through the case to steady the rear of the PCB that the DB-9 is mounted on. I tried to figure out if it was possible to be precise enough to put the screw-holds through the case to stabilize the port. I mean, how could craft glue really hold up to the stress of plugging and unplugging serial cables?
Oh it holds.
While I was taking the pictures for this article, I put too much hot glue under the chip and it squeezed out through the bottom of the cutout, between the case and the port. This made it impossible to get a good contact on the cable.
Simple enought to fix, right? Just pull the (now sacrificial) chip off the case and glue another one in. Yeah right. Getting the chip off of the hardened glue destroyed the PCB. On top of that, I had to literally chisel the glue off of the case so I could try again.
Hot glue is awesome. It's like wielding a gun full of liquid plastic. It's like a non-conductive solder. And it doesn't get hot enough to damage eletronic components.
Okay, now that you're convinced, let's start putting down some glue.
First, you're going to want to focus the glue on the end of the PCB away from the DB-9 port. Otherwise, you run the risk of ruining the whole project with some stray polymer. I strongly recommend plugging in a serial cable into the port, through the case. This will help to stabilize the card, but it also positions the connector with ample clearance on the top and sides. It's no use putting in a serial port you can't plug into!
Let the glue dry. You'll know its done with the exterior of the case, under the RS232 board, is no longer warm to the touch. Also, the chip won't move. Seriously. Try moving it. Plug that serial cable in. Do it angry! Get rough with it!!
All parts are assembled, all connections soldered, all cases modified and all chips sufficiently glued. Time to wire it up.
Since we soldered headers onto the motherboard, and the RS232 came with headers pre-assembled, we need female-to-female jumper wires, preferably all different colors. That means it's off to Mouser again!
First, make sure the unit is completely unplugged. If you've got the serial cable in unplug that too, just to be safe.
The wiring is pretty straightforward, we want to wire voltage source (Vcc) to voltage source (also, Vcc) and ground (GND) to ground (again, GND). That leaves the Rx and Tx lines. These need to be crossed, so that the Rx pin on the motherboard connects to the Tx pin on the RS232 board, and similarly, the Tx pin router-side should connect to the Rx pin on the serial port:
Or, if you prefer tables:
JP2 | ← jumper → | RS232 |
---|---|---|
pin 1 | not connected | |
pin 2 | red | Vcc |
pin 3 | not connected | |
pin 4 | white | Rx |
pin 5 | not connected | |
pin 6 | green | Tx |
pin 7 | not connected | |
pin 8 | not connected | |
pin 9 | not connected | |
pin 10 | black | GND |
It is vitally important that you get the wiring right. If you don't some bad things of varying degrees of you're-screwed-itude can happen:
At this point, throw in some wire ties to keep things tidy and you can put the case back together. The hardware part is all done!
(Strictly speaking, you really should be testing throughout the build process. As soon as I got my hands on the cable and the components, I wired it all up to verify that I had the pinouts right and that I knew how to actually use the serial port. I put the testing section last so that it was easier to find, and all in one place.)
All done with the hardware and ready to start using it? Great. Let's talk RS232.
Fun Fact: Did you know that the RS in RS232 stands for Recommended Standard? I sure didn't! Now you can wow your friends at parties!
In order to talk to our router over its shiny new serial port, you'll need the following things:
While laptops and desktops long ago dropped their vestigal DB-9 serial ports, everyone has a USB port or four. Luckily, you can pick up a USB/DB-9 serial cable for pretty cheap. Here's the one I bought off of Monoprice:
As for a serial emulation program, I tend to use picocom
because it is simple and completely command-line driven. minicom
also works, although it needs a little extra de-configuration to disable its dialing tendencies.
With those two requirements well in hand, let's turn to the last: Configuration.
With serial links, there are five (5) things you need to know:
For our purposes (and for configuring serial links to most modern devices), the answer to those questions are:
The combination of byte size, parity and number of stop bits is usually abbreviated into a three-character designation, like 8N1. The first position indicated byte size. You may see references to 7N1, which is just like 8N1 except that it only counts 7 bits to the byte. The second position indicates parity, either even (E), odd (O) or none (N). The third position identifies the number of stop bits, which in our case is 1.
So, armed with all of this newfound knowledge about ancient communication protocols, let's plug into the router serial port and fire up picocom:
$ picocom -b 115200 -e x -p n -d 8 /dev/ttyUSB0
Terminal ready
CFE version 1.0.37 for BCM947XX (32bit,SP,LE)
Build Date: Thu May 26 10:55:05 CST 2005 (root@localhost.localdomain)
Copyright (C) 2000,2001,2002,2003 Broadcom Corporation.
Initializing Arena
Initializing Devices.
No DPN
et0: Broadcom BCM47xx 10/100 Mbps Ethernet Controller 3.90.37.0
CPU type 0x29008: 200MHz
Total memory: 16384 KBytes
Total memory used by CFE: 0x80300000 - 0x803A3620 (669216)
Initialized Data: 0x80339530 - 0x8033BC40 (10000)
BSS Area: 0x8033BC40 - 0x8033D620 (6624)
Local Heap: 0x8033D620 - 0x803A1620 (409600)
Stack Area: 0x803A1620 - 0x803A3620 (8192)
Text (code) segment: 0x80300000 - 0x80339530 (234800)
Boot area (physical): 0x003A4000 - 0x003E4000
Relocation Factor: I:00000000 - D:00000000
Boot version: v3.6
The boot is CFE
... etc ...
It works! The Terminal ready line at the top is from picocom itself, but everything else is from the WRT router and its CFE bootloader.
I've got a lot more to say on the subject of what you can do with a serial port on a WRT device, but this piece is drawing to a close. Keep an eye out for future posts where I hope to go over the boot process in excruciating detail, show how to get in and play around with the bootloader itself, and walk through flashing the device directly via serial+tftp. Oh, and we'll get to build our own OpenWRT images!
Happy Hacking!
]]>I tried to give Automake a fair shake, I really did. But after trying it on a half-dozen non-trivial compiled projects, I have to conclude that is is more trouble than it's worth.
Why?
Automake purports to ease the burden of writing those awful makefiles. It does so by constraining you to a very specific software build discipline, rounding off all those sharp edges so that you can't accidentally hurt yourself. This saves some time, but only in one phase of the project: the beginning. With the full Autotools stack, it is almostr trivial to set up a new project.
(Note: I'm talking exclusively about software projects written in compiled languages like C. I have never tried to use Automake for anything else, but I imagine it's just as worthless.)
If you go the Automake route, you get to forget you ever saw a Makefile, and go about your day writing code for The Next Big Thing ™. Sounds great! So what's wrong with Automake again?
Let's put it a different way.
Code is hard, right? Put aside any language bias for a moment — I'm not talking about the horrors of C memory management or the problems of JVM performance. I mean that code is hard. All the high-level, garbage-collected, pointer-not-having language features in the world won't change that fundamental fact that it takes a certain mind- and skillset to decompose real-world problems into multi-layered abstractions and model them in ones and zeros.
Agreed? Okay. Moving on.
Since code is so hard, let us solve that problem. Why don't we promote more re-use? What if we could promote re-use across language barriers? What if we could promote re-use across time? If we settle on a single calling convention for bits of logic, and a standard, bi-directional transport for passing messages into and out of those bits of logic, we could make coding a ton easier.
We have that. We call it the Unix philosophy. The calling convention is execve(2). The message bus is pipe(2).
I'll just wait here while the entire world of enterprise software development, all Open Source projects everywhere, and every startup ever rewrites everyting in bash.
...
Why isn't anyone throwing out their library of Java programming books for a copy of O'Reilly's LTBS3e? Because it's ludicrous to think that bash is the ultimate programming environment because it hides complexity behind a few pipes and some execs.
Let's take a different tack.
Hell, let's switch industries for a minute.
A charcutier (one who practices charcuterie) is a butcher who deals with prepared meats like patés, bacon and sausages. It's an exacting profession requiring a depth and breadth of knowledge that blends art with science — not unlike what a hacker does with bits and bytes. Between curing, fermentation, brining, emulsification and seasoning, the charcutier has a number of highly specialized and demanding tools at her disposal.
And yet the charcutier still grinds meat to make sausage.
Don't skip the fundamentals, no matter how unpleasant they may seem. Grinding meat, like writing Makefiles, is an essential step in their respective processes, and skipping them is a disservice to their crafts.
]]>Today, the not-invented-here department brings you the following gems.
Dulwich is the answer to the question: what kinda works like git, but lags behind it in bug fixes and feature requests?
That's right folks. Torvalds' own version control system, the software that powers the most massive collaborative Open Source development effort, just isn't good enough for Pythonistas. Because it's written in C.
Nagios (or Icinga if you who keep up with / care about that sort of thing) is perfect, except for that pesky legacy as a system that was kicking ass before Python 2.0 was released. Someone decided to rewrite the entire Nagios core (bad decisions and all) in Python.
Seriously, this is almost as bad as the Java guys and JeroMQ.
]]>with apologies to Walt Whitman
O Sandal! my Sandal! the summers we have fared,Your sole is cracked from weathering, your tender core laid bare,The time is near, thy knell I hear, my friends and family cheering,While follow I the steady thong, thy strength alight in song.
But O heart! heart! heart!O the footprints so well tread,Where on the deck my sandals lie,Fallen cold and dead.
O Sandal! my sandal! Would that thou saw the field;Rise up—for you the sun does shine—for you the laughter peeled.For you fresh air and barbecues—for you the night's carousing,For you I called, springtime warming, eager toes a-stretching,
Here Sandal! dear Sandal!My hand beneath your heel!It is some dream that on the deck,You've fallen, cold and dead.
My Sandal does not answer, it sole is cracked and still,My Sandal does not feel my hand, it has no pulse nor will,With spring's blooms here—summer's close behind,From tearful trip I come, for new footwear to find;
Exult O trees, and sing O birds!But I with mournful tread,Walk the deck my Sandals lie,Fallen, cold and dead.
Sandals
June 4th, 2004 – May 18th 2014
May they rest in pieces.
ctap is an easy way to get Perl-style TAP testing convenience in C. It ships as a standalone shared library that you can link to your tests, and a header file that contains functions and macros for doing things like assertions, skip/todo blocks and dynamic evaluation.
Check it out on GitHub: http://github.com/jhunt/ctap
ctap stays out of your way, letting you focus on writing tests:
#include <ctap.h>
tests {
ok(1 == 1, "1 does in fact equal 1");
}
When run, this will output:
ok 1 - 1 does in fact equal 1
1..1
This is TAP, so you can use prove and all of its -v option to control output:
$ prove t/01-sample
t/01-sample .. ok
All tests successful.
Files=1, Tests=1, 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU)
Result: PASS
$ prove -v t/01-sample
t/01-sample ..
ok 1 - 1 does in fact equal 1
1..1
ok
All tests successful.
Files=1, Tests=1, 0 wallclock secs ( 0.01 usr + 0.01 sys = 0.02 CPU)
Result: PASS
Here's a more complicated example, using some fancier and more well-to-do assertions like isstring
and isntnull
:
#include <ctap.h>
tests {
char *s = "a string";
is_null(NULL, "NULL is a null pointer");
isnt_null(s, "s is not null");
is_string(s, "a string", "s is 'a string'");
is_string(s, "empty", "s is not 'empty'");
}
And here is the output:
ok 1 - NULL is a null pointer
ok 2 - s is not null
ok 3 - s is 'a string'
not ok 4 - s is not 'empty'
# Failed test 's is not 'empty''
# at ./t/01-sample.c line 8.
# got: 'a string'
# expected: 'empty'
1..4
# Looks like you failed 1 test of 4.
Building libctap.so -------------------
If you've cloned from the upstream git repo, you'll want to bootstrap:
$ autoreconf -vi
To build, follow the standard process:
$ ./configure
$ make
$ sudo make install
If you want to hack on ctap, don't forget to rebuild all of the autotools files when you make changes to Makefile.am, configure.ac and friends via autoreconf
.
Linking with libctap --------------------
It couldn't be easier. Build each test as a standalone executable, and link them with LDFLAGS of -lctap
:
$ gcc -c -o t/01-sample.o t/01-sample.c
$ gcc -lctap -o t/01-sample.t t/01-sample.o
$ prove
Assertions, Assertions, Assertions ----------------------------------
The most basic of assertions, and also one of the most flexible.
ok(x == 3, "x was three")
ok(sqrt(y) == 2, "sqrt(y) is 2 (y is %d)", y);
You can write all of your tests using nothing but ok()
, but you may want to look at some of the more advanced assertions that give better failure diagnostics.
Assert that strings a
and b
are equivalent, even if they point to different memory regions:
is(name, "ctap", "Library name should be ctap");
If either value is NULL
, the assertion will fail, since null strings are not logically equivalent.
Strings are taken as standard C-style, NULL-terminated strings. If your strings are not terminated, you can easily overrun the stack.
Assert that strings a
and b
are different.
isnt(errstr, "Failed to read data", "errstr isnt a read-fail");
If either value is NULL
, the assertion will fail.
Strings are taken as standard C-style, NULL-terminated strings. If your strings are not terminated, you can easily overrun the stack.
Compare two values using arbitrary operators. This is slightly more useful than plain old ok()
, because it will print both values and the operator via diag()
when it fails.
cmp_ok(x(), "!=", y(), "x() and y() are different values");
cmp_ok(f(g()), "==", g(f()), "f() and g() are composable");
Unconditionally pass. The following are equivalent:
pass("works for me!");
ok(1, "works for me!");
Unconditionally fail. The following are equivalent:
fail("broken");
ok(0, "broken");
Diagnostics and Notations -------------------------
Prints a diagnostic message. This message will not interfere with the test output.
diag("sleeping for up to %d seconds", timeout);
Running under prove
(without -v
) will suppress all output from diag()
.
Works like diag()
, except that prove
will display the message whether in verbose mode and normal mode (with and without -v
).
note("testing %s v%s", PACKAGE, PACKAGE_VERSION);
Skip and Todo Blocks --------------------
ctap supports skip and todo blocks, via the SKIP and TODO macros:
SKIP("not ready for primetime") {
ok(experimental_function(), "should be ok");
}
TODO("api-internals are still under heavy rework") {
ok(api_internals(), "should work fine");
}
All tests in a SKIP block will still be run, but ctap will pretend as if they had implicitly succeeded. In a TODO block, test can fail but will not count against the test suite as a normal failure would.
Note: these macros are themselves experimental, and their interfaces may change in future versions of libctap. Do not rely on them in real test suites.
]]>Today I release arpscan
into the wild, as an Open Source project.
ARP (RFC 826) is the Address Resolution Protocol, used by machines, to find other machines, normally for the purpose of sending them packets (like TCP/IP packets or UDP datagrams).
Ethernet has no idea about routing, or even what an IPv4 address is. All it knows about is the datalink and MAC Addresses. Ethernet frames have two such addresses, the source MAC (known to the host, from the egress interface) and the destination MAC. Since keeping static tables of MAC addresses for all hosts on a LAN is less maintainable than /etc/hosts
was before DNS, ARP was born.
For our purposes, ARP consists of only two operations: request and replies. A request is sent out into the segment to determine what MAC address owns a specific protocol address. A reply is sent from the owning host, answering that query.
Enter arpscan
.
arpscan
sends a flurry of ARP requests out onto the local segment, probing to see who (if anyone) owns each and every IP address in a range like 10.15.0.0/24. As ARP replies come in from the other hosts, MAC → IPv4 address associations are printed to standard output.
The beauty of this type of scan is pretty simple. ARP requests cannot be blocked; it makes little sense to be on a network but refuse to tell anyone where you really are. Without ARP, no packets would get to you. On top of that, it is fast. Faster than a ping loop; faster than nmap. And unlike arping
and friends, arpscan
bypasses the system ARP cache entirely. If you've ever scanned a large network (10.0.0.0/8 springs to mind) with arping, you know why this is desirable.
Here's some examples, to whet the appetite:
# arpscan -d wlan0 -n 10.0.0.0/24
68:7f:74:b4:53:aa 10.0.0.1
00:30:67:69:12:f9 10.0.0.15
00:80:77:8d:d6:7a 10.0.0.92
3c:43:8e:09:03:4b 10.0.0.93
And here's my virtualized test environment:
# sudo arpscan -d virbr2 -n 10.10.10.0/24 -t 1200
52:54:00:18:40:9b 10.10.10.130
52:54:00:3e:53:5e 10.10.10.129
52:54:00:61:64:db 10.10.10.132
52:54:00:a4:53:c8 10.10.10.131
52:54:00:2d:f9:16 10.10.10.123
52:54:00:2f:a9:77 10.10.10.128
52:54:00:a6:69:3b 10.10.10.124
52:54:00:c5:21:0b 10.10.10.133
aa:89:dc:f1:87:45 10.10.10.126
52:54:00:7c:a8:ae 10.10.10.121
To grab your own copy, download the code from github:
WARNING: arpscan is still very beta software (less than a day old as of this writing). It may not work well on large networks. It may not work well (or at all) on your network. I welcome bug reports and patches, the latter moreso than the former.
Happy Hacking!
]]>As far as greek letters go, λ is a computer science favorite, thanks in no small part to Alonzo Church's λ-calculus.
For reasons now unknown to me, I avoided learning λ-calculus for many years. I guess I thought it was too math-y for the real world of software.
Turns out, λ-calculus is easy.
Here's a name:
$$x$$
This is a function:
$$λx.x$$
or, in Javascript:
function (x) { x }
This is the identity function, often called k.
The first \(x\) (after the \(λ\) but before the \(.\)) identifies the argument to the function, which will be bound to the name \(x\). The expressions after the \(.\) are the body of the function.
In λ-calculus, functions aren't called, they are applied. Since Church was interested in modeling computation, you can't have a function without arguments (what can you compute with no input?) Therefore, functions are applied to their arguments, to compute their value.
$$(λx.x)y = y$$
The parentheses are necessary to differentiate the application of \(λx.x\) to \(y\) from the definition of the function \(λx.xy\) (two wholly separate expressions).
In Javascript, you would write:
(function (x) { x })(y);
When applying functions, instances of the argument name in the body of the function are replaced with the applied values. In the above case,
$$(λx.x)y ≡ y$$
I.e., if we replace all occurences of \(x\) in the body of the function \(λx.x\) (which is just \(x\)), we get \(y\).
In functional application, we must differentiate between free and bound variables. Consider the function:
$$λx.xy$$
\(x\) is a bound variable (because it is present in the named arguments of the definition) and \(y\) is a free variable (its value must be defined externally).
If we apply this function to \(z\), we get:
$$(λx.xy)z ≡ zy$$
by replacing all \(x\) with \(z\); \(y\) is unaffected, because it is free.
Pretty cool, huh?
But Church doesn't stop there. After all, he was looking for a way of modelling computation, not just performing it.
One thing we may want to know when analyzing computation is if one function is equivalent to another function.
Are these two Javascript functions the same? equivalent?
function add(x, y) { return x + y }
function combine(a, b) { return a + b }
They aren't identical since each has its own unique name, but they do seem to be equivalent. How can you rigorously prove that?
Well, you can't in Javascript, but you can in λ-calculus, via α-conversion (alpha conversion). First, let's define add
and combine
as λ terms:
$$\text'add'\;→ \;λxy.+xy$$
$$\text'combine'\;→ \;λab.+ab$$
An α-conversion, or alpha rename, is a systematic process of renaming all bound variables in the function arguments and its body. Through two such conversions, we can convert add
into combine
, and thus demonstrate α-equivalency:
$$λxy.+xy → λay.+ay → λab.+ab$$
First, we rename the bound variable \(x\) to \(a\), in both the function signature (so that \(λxy\) becomes \(λay\)) and the body (\(+xy\) becomes \(+ay\)), and then we do it again, renaming the bound variable \(y\) to \(b\). At the end, we are left with our combine
function. Hence,
$$λxy.+xy ≡ λab.+ab$$
i.e., they are α-equivalent.
Beta reductions sound scarier than they actually are. They are at the heart of function application.
While α-conversion is concerned with substituting one symbolic name for another (e.g. \(x\) for \(a\), \(y\) for \(b\)), β-reduction substitutes values for symbolic names and actually performs the computation.
$$(λxy.+xy)\:3\:4\; → \;+\:3\:4\;→ \;7$$
(assuming that \(+\) denotes addition)
With nested function applications, β-reduction is done once per application. Consider the following (continuing under the assumption of what \(+\) means):
$$(λxy.+\:3\:(λwz.+wz)xy)\:9\:2$$
$$+\:3\:(λwz.+wz)9\:2$$
$$+\:3\:(+\:9\:2)$$
$$+\:3\:(11)$$
$$14$$
See, it's not that bad!
Check out Achim Jung's A short introduction to the Lambda Calculus, as well as A Tutorial Introduction to the Lambda Calculus, by Raúl Rojas.
]]>Loopback devices are really cool.
The concept is simple enough: take a file, and mount it like it was a block device. What good is that? Let me tell you a story.
I'm writing a small web application that accepts file uploads. The web server layer handles incoming uploads by storing them on-disk, in a temporary area, like /tmp
. The application verifies some pieces of the request (mostly related to database state and API headers) and then tries to move the uploaded file out of /tmp
and into /srv/data/foo/bar
Works great in my local development environment. Fails miserably in production.
Turns out, the last bit of the process uses a UNIX hard link (the kind you get when you forget the -s
in ln -s
). My dev environment only has one filesystem on one device, so hard-linking from /tmp to to /srv is no big deal. Not so in production.
Since I'm not about to re-partition my dev environment, I need a way of testing cross-device uploads. Enter Loopback Devices.
To get started, build out a blank file using the venerable dd
utility:
$ dd if=/dev/zero of=disk.dev bs=4096 count=10240
Now, we need to bind a loopback device to our device file. First find out if there are any loopback devices in use already:
$ sudo losetup -f
/dev/loop0
$ sudo losetup /dev/loop0 disk.dev
$ sudo losetup /dev/loop0
/dev/loop0: [0802]:3300012 (/path/to/disk.dev)
Note: your output may vary.
Next up, we need to create a filesystem on our loopback device. I use ext3 because it's easy and well-supported, but you can use whatever you want and/or need.
$ sudo mke2fs -j /dev/loop0
Finally, mount it (in this case, to a new directory):
$ sudo mkdir /mnt/vfs
$ sudo mount /dev/loop0 /mnt/vfs
Armed with this new trick, you can emulate any number of different filesystems, cross-device fragmentation, multi-mount environments and more. With a properly outfitted test environment (a little sudo, filesystem sandboxing, etc.) you can even write automated unit tests that take advantage of this.
When you're all done, and want to cleanup, just unmount the device and detach the loopback device:
$ sudo umount /mnt/vfs
$ sudo losetup -d /dev/loop0
Happy Hacking!
]]>Cleaning out my email this morning, I found this little gem:
From: Inna Dumanska <XXXXX@4tegroup.com>
Date: Fri, 8 Feb 2013 10:13:06 -0600
Subject: Appreciate your expertise with this Open Source engineering issue
James,
Hope you have a frightful Friday. I recently reviewed your profile on
StackOverFlow.
Frightful, indeed.
I noticed you are an experienced systems administration and software
development professional and I would greatly appreciate your expertise with
this exciting but difficult-to-fill position I'm working on.
Apparently, being good at something technically qualifies me to do HR work...
Right now we are looking for an exceptional Sr. Systems Engineer in DT
Chicago with a bright software company. It's is one of the largest private
clouds and a global leader in software that processes human information and
unstructured data. The engineer will be working with Puppet, CFEngine, DB2,
MySQL as well as Linux...
DT = downtown? downtime? dance team? david tenant? day tripper?
Also, exceptional people who want to work at a bright software company? Do they also have to be an above average driver?
Hmm. Puppet / Cfengine, DB2 / MySQL. Couldn't decide? Doesn't sound too bright.
James, I do understand you are not a Linux bigot. I would appreciate if you
can just direct me as where to find such professionals as you interested in
Linuxy and for any valuable insights you might want to share.
Thank you. Have a cheerful upcoming weekend.
Inna Dumanska | Technology Recruiter | FORTÉGroup
Okay, this section probably deserves a little context, so here is an excerpt from my StackOverflow C.V.:
I assure you, I am not a Linux bigot. I mean sure, it beats every other OS
hands-down for sheer flexibility, power and stability, but I'm not a Linux
bigot. I would turn down a 500k/year job if it meant managing Windows
servers, but I am not a Linux bigot. Okay, I refuse to deploy web-based
applications on IIS, because Apache is so much better, but I am not a Linux
bigot.
When I wrote it, I felt it was dripping with sarcasm and snarkiness. I guess that was lost on poor Inna.
]]>I've been meaning to work my way back through Knuth's Art of Computer Programming, and maybe even Essentials of Programming Languages. I cracked the digital spine on EOPL3 last night, and realized that doing the exercises was going to require the maths.
Which is no problem, except that I have fallen out of the habit of writing in notebooks. That leaves some digital format, and I immediately thought of my blog! That of course brought up the question at hand: how to display mathematical formulae with beauty and clarity, using only HTML?
$$∑↙{i=0}↖n i={n(n+1)}/2$$
Easy, right?
I'm using a small library called jQMath, that acts like a live, in-browser source filter. It converts tagged regions of text (surrounded by $$
) into stylized HTML like the Sigma summation above.
A few things to note:
$$\[∫_\Δd\bo ω=∫_{∂\Δ}\bo ω\]$$
I stress the Unicode requirement because I spent the better part of the afternoon fighting my tools. To clarify: this is the fault of the code I wrote to run this website, and is not a deficiency in jqmath.
$$(\table \cos θ, - \sin θ; \sin θ, \cos θ)$$
I ended up rewriting John Gruber's markdown tool (the one that you can get from the Ubuntu repos) so that it handled UTF-8 input and produced proper UTF-8 output. Otherwise, you can't do much beyond simple ASCII variables:
$$a^2+b^2=c^2$$
That works without Unicode, but not much else does. And what's the point of having a kick-ass formula renderer if you can't use the really cool math symbols?
$${1+√5}/2=1+1/{1+1/{1+⋯}}$$
As for the setup, all you really have to do is the following:
<body>
, with my other libs)Happy Hacking!
]]>You need not quote great men to be one - a great man]]>
Wow your friends with this command-line tomfoolery: MySQL has a zombie mode!
It's really easy. First, a little bit about what you get in zombie mode:
In batch mode, MySQL uses a vertical tab to separate columns of resulting data, instead of its fancy pipes-and-plus-signs table renderings.
So, instead of this:
+----+----------+---------------------------------------------------------------------+
| id | name | notes |
+----+----------+---------------------------------------------------------------------+
| 1 | Werewolf | Also called a lycanthrope
Not as cool since the last Twilight movie |
| 2 | Zombie | Ambling undead, of varying speeds see also: zombie process |
| 3 | Vampire | NULL |
+----+----------+---------------------------------------------------------------------+
you'll get this:
id name notes
1 Werewolf Also called a lycanthrope\nNot as cool since the last Twilight movie
2 Zombie Ambling undead, of varying speeds\tsee also: zombie process
3 Vampire NULL
Zombies don't put much stock in presentation. Neither does MySQL, in raw mode. Special characters embedded in the data (like t and n) will be rendered literally, without backslash-escaping.
So, instead of this:
id name notes
1 Werewolf Also called a lycanthrope\nNot as cool since the last Twilight movie
2 Zombie Ambling undead, of varying speeds\tsee also: zombie process
3 Vampire NULL
you'll get this:
id name notes
1 Werewolf Also called a lycanthrope
Not as cool since the last Twilight movie
2 Zombie Ambling undead, of varying speeds see also: zombie process
3 Vampire NULL
(note the literal newline in the first record; you know damn well the Zombies will).
Autocompletion of table and column names makes people faster. Zombies amble leisurely through undeath, and don't need to save any time.
Keeps MySQL from being picky about where you put your whitespace in SQL queries. Zombies apparently can't type.
You know what your data looks like. You don't need column names mucking up the carefully escaped (remember raw mode?) data and forcing you to strip off a line.
1 Werewolf Also called a lycanthrope
Not as cool since the last Twilight movie
2 Zombie Ambling undead, of varying speeds see also: zombie process
3 Vampire NULL
Aside from the occassional gurgle, Zombies are pretty quiet dudes (and dudettes). MySQL can be just as quiet, with this flag.
Put it all together, and whaddya get?
If you have programmed in C for any length of time, you already know that your C code gets compiled down to assembly, then to machine code for whatever platform you are building for. What you may not have realized is that this lets you experiment with assembly by writing and half-compiling C. If you can get the compiler to stop right after it has written the assembly code, you can get in and see how it thought C constructs should be represented at the lowest level. Luckily, GCC's -S
option does just that.
This is a very powerful learning tool if used properly. It can lead to a confusing mess if used improperly. Let's lay down some ground rules:
GCC doesn't like you. Sure, it will compile your programs for you, and it rarely complains unless you do it wrong, but first and foremost, GCC is not out to please you; the job of the compiler is to produce machine code, not human-readable assembly. Its second priority is to produce the fastest, most efficient and most secure machine code possible.
Unfortunately, fast, efficient and secure code is not always the most comprehensible. Luckily for GCC, no human ever reads the code it generates. GCC can get away with all kinds of crazy optimizations because 99.9% of the world cares how well the resulting binary runs, not how clean its machine code is.
To illustrate this, let's write the same program, once in C, which we will compile down to assembly, and once in Intel assembly.
First, the assembly version:
$ cat simple-asm.s
.text
main:
ret
$ gcc simple-asm.s -o simple-asm
Here is the C version:
$ cat simple-c.c
int main(int argc, char **argv, char *env)
{
return 0;
}
$ gcc simple-c.c -o simple-c
Run both of these programs and verify that they do, well, nothing. The $?
shell variable should also be 0
after each binary runs.
Now it's time for some real fun. Let's see how GCC thinks we should write our assembly code to properly implement the C code from simple-c.c
.
$ gcc -S simple-c.c -o simple-c.s
The simple-c.s
file will contain the generated assembly we are looking for. Here's what I got:
$ cat simple-c.s
.file "simple-c.c"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1"
.section .note.GNU-stack,"",@progbits
(due to differences in platforms, architectures and GCC versions, you may get the same thing, or you may not. YMMV)
That's a lot of more code. Some of it comes from GCC optimizations, some if it comes from the actual C code we wrote, and some of its is just GCC being GCC.
The .file
directive tells the loader what C source file this assembly was originally generated from. It can be removed (at the cost of debugging ability).
The .globl main
statement identifies the main:
label as a global symbol, that the loader should export. Since we are using GCC to assemble down to machine code, we need to keep this. Otherwise, we'll get loader errors about undefined references to 'main' in the _start symbol. The .type
directive, on the other hand, can be removed safely.
The lines that start with cfi_
are Call Frame Information directives, and can safely be removed, although I have yet to find a GCC option that will do so.
Everything after .LFE0:
can also safely be ignored; these are directives that as
and ld
use for their own purposes. For example, the .ident
directive identifies that I used GCC v4.6.1 on an Ubuntu Linux box to generate the assembly code.
To prove that these parts of the assembly are non-critical, let's remove them, and try to build again:
$ cat bare.s
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
movl $0, %eax
popq %rbp
ret
$ gcc -o bare bare.s
With all that extra cruft out of the way, we can get back to the task at hand: seeing how GCC translated our simple C program into assembly. Let's step through the code, op-code by op-code, to see what is going on.
pushq %rbp
Push the value of the RBP (frame pointer) register onto the stack. Since GCC sets up our _start harness for us, we need to save the memory address that jumped to us, so we can jump back to it when we are done. Later, we will see a popq instruction that re-sets the frame pointer. (Note: I did all of this on an x86_64-linux, so pointers are 64-bit values, hence the q in pushq)
movq %rsp, %rbp
Copy the 64-bit value from the RSP (stack pointer) register into the frame pointer register. Keep in mind that GCC defaults to AT&T assembly syntax, in which the operands are specified source,destination.
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movq %rdx, -24(%rbp)
Copy values from general purpose registers to the stack. EDI houses the 32-bit int argc
argument to main, RSI contains the 64-bit pointer char argv
, and RDX contains the 64-bit pointer char environ
.
Always remember what ABI you are compiling against. Since my work was done on x86_64-linux, the binary call interface states that register arguments go in %rdi
, %rsi
, %rdx
, %rcx
, %r8
, and %r8
. This is different than the ABI for 32-bit platforms!
movl $0, %eax
Copy the literal value 0
into the 32-bit EAX register; this is the return value of main. Later, we'll modify the C program to prove this out.
popq %rbp
Reset the frame pointer so we can jump back to the _start
harness.
ret
Return back to the calling code.
That's a lot of instructions for a decidedly uninteresting program, which leads me to my next point...
Even without all of the code that GCC adds, the code generated from our C file did a lot more than our hand-written assembly. To see why, let's try try re-writing the program to ignore the things we don't need, regenerate the assembly and see what changed.
Every C programmer has practically memorized the signature for the main
function: int main (int argc, char argv, char environ)
. That may be the correct way to define main
, but it is not the required way. Since we don't actually use argc
, argv
or environ
in our program, let's try removing them from the signature:
int main(void) {
return 0;
}
Translating that into assembly, and removing the GCC extensions, we get the following assembly:
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
popq %rbp
ret
Better. Without the parameter list in our definition of main
, GCC omitted the assembly that interacted with the stack to pull off argument values. In addition, we can start to see what bits of assembly are used to do what. Here's a diff of bare.s
and no-sig.s
:
--- bare.s 2012-07-09 12:47:55.054788188 -0400
+++ no-sig.s 2012-07-09 12:47:26.530787076 -0400
@@ -3,9 +3,6 @@
main:
pushq %rbp
movq %rsp, %rbp
- movl %edi, -4(%rbp)
- movq %rsi, -16(%rbp)
- movq %rdx, -24(%rbp)
movl $0, %eax
popq %rbp
ret
The only thing that was removed from the assembly by not specifying the parameter list was the code that dealt with the stack. We still have the pushq/popq instructions that deal with the frame pointer register, and the movl instruction for our return value.
What happens if we remove the return
statement?
void main(void)
{
// how very zen...
}
Still valid C. Let's take a look at the assembly:
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
popq %rbp
ret
Success! The movl
statement that put the return value of 0 in EAX is gone! Now that we have baseline C code that will produce minimal assembly, let's get creative.
Operating Systems and user-land programs need to way to interact. On most OSes, including UNIX and Linux, this is done through system calls, which transfer control to the kernel for a well-defined operation. Linux system calls are documented in section 2 of the man pages. We'll start our real journey into C-assisted assembly with the write
system call.
write
, unsurprisingly, writes data to a file descriptor. It's time for that old favorite, the Hello, World application!
void main(void)
{
write(1, "Hello, World!\n", 14)
/* 14 = strlen("Hello, World!\n") + 1 (for '\0') */
}
Compile that to a binary and make sure it works. Go ahead, I'll wait.
Working? Great. Let's examine the assembly.
.section .rodata
.LC0:
.string "Hello, World!\n"
.text
.globl main
main:
# standard frame pointer management
pushq %rbp
movq %rsp, %rbp
movl $14, %edx # 3rd arg to write -- number of bytes
movl $.LC0, %esi # 2nd arg to write -- buffer to write
movl $1, %edi # 1st arg to write -- output file descriptor
movl $0, %eax
call write
# standard frame pointer management
popq %rbp
ret
(I reformatted the assembly code a bit, introducing newlines to aid legibility)
This new program actually introduces two concepts: constant data and system calls. As such, we get a new section, called .rodata
which stores our statically initialized "Hello, World!n" string. GCC adds a .LC0
label, which will be used later to refer to the memory location where our string sits.
Lines 11, 12 and 13 prime the general purpose registers with the arguments to our write
call; EDI contains the file descriptor 1 (standard output), ESI is the memory address of our string, and EDX houses the third argument, the number of bytes that should be output (the constant 14).
Even though GCC put the register moves in reverse order, you don't have to. We are dealing with discrete destinations, and not a stack where order would matter.
Line 14 sets the EAX return register to 0, and Line 15 issues the system call. The rest of the assembly is the boilerplate stuff we are used to seeing (frame pointer reset and return control).
What heresy is this? Avoid the C library? But it contains so much useful functionality that we don't want to have to write in assembly!
If you are actually trying to write 100% assembly programs, by all means, use the libc. However, for the purpose of learning assembly from idiomatic C, you're better off not using handy functions like printf
and instead using system calls like write
.
Why?
The generated code for system calls is much easier to understand, and since we are only interested in the mechanics of assembly, that's a huge win. To illustrate, try compiling this small snippet to assembly:
void main(void)
{
printf("Hello, World!\n")
}
You'll probably get something like this (cruft removed):
.section .rodata
.LC0:
.string "libc calling conventions"
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
movl $.LC0, %edi
call puts
popq %rbp
ret
Looks easy enough. In fact, it's less code than the write
system call example. But what if we print more than just a single literal string?
#include <stdio.h>;
void main(void)
{
printf("%s, %d, %d, %d, %d, %d\n",
"1-5 = ", 1, 2, 3, 4, 5);
}
And here's the assembly:
.section .rodata
.LC0:
.string "%s, %d, %d, %d, %d, %d\n"
.LC1:
.string "1-5 = "
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $.LC0, %eax
movl $5, (%rsp)
movl $4, %r9d
movl $3, %r8d
movl $2, %ecx
movl $1, %edx
movl $.LC1, %esi
movq %rax, %rdi
movl $0, %eax
call printf
leave
ret
First of all, we ran out of general purpose registers and had to move to the stack (thus the movl $5, (%rsp)
instruction). Secondly, this version calls printf
but the last version caused puts
. What gives?
The C library is a lot like GCC: they both exist to produce fast machine code, not predictable, clear machine code. The printf
function is actually a complicated C macro that does all kinds of cool optimizations deep down in the heart of the C library.
I'm not saying that you should never use standard library functions. There is a time and a place for playing with libc to see how the assembly under the hood works. That being said, I strongly recommend that you save that type of exploration for when you are proficient enough in reading (and writing) assembly.
ABI stands for Application Binary Interface, and just like an API (Application Programming Interface) it specifies the exact mechanics of interfacing with another component, but at the machine code level.
The C Standard Library, for example, defines an API for dealing with system calls like creat
. If you look up the man page for creat
(man 2 creat) the calling convention is documented right at the top:
NAME
open, creat - open and possibly create a file or device
SYNOPSIS
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int open(const char *pathname, int flags);
int open(const char *pathname, int flags, mode_t mode);
int creat(const char *pathname, mode_t mode);
If you want to use the creat
call from C, you call it with the absolute path name of the file (as a null-terminated character string), and a mode parameter. It will return to you an integer, the semantics of which are defined further on in the manual.
Similarly, bits of machine code (including your program and the operating system) needs to agree on how they will communicate using machine registers, memory and the stack. This is the ABI.
Because different machine architectures have different memory models, hardware registers and bus architectures, they also have different ABIs. This is why a binary executable compiled for Linux on the Intel 32-bit x86 platform won't work on the Intel 64-bit x86_64 platform, even though they may share instruction sets.
Normally, application programmers don't care about the ABI; it's something the compiler graciously takes care of. But in the world of assembly, you are the compiler, so you have to know these things.
The examples in this article (unless otherwise marked) are written for the Intel x86_64-linux ABI. There are notable differences between 32-bit and 64-bit Linux from the ABI standpoint, the most prominent of which is register usage.
On x86_32-linux, all arguments to functions are passed on the stack, without exception, in-order. On x86_64-linux, the stack is only used for parameters that are too big to pass in registers, and for parameters beyond the 6th. Registers for x86_64 calls are, in order, DI
, SI
, DX
, CX
, 8D
, and 9D
. If you are looking at assembly that does lots of pushl
instructions (but never a pushq
) before calling out to another function, it is most likely x86_32.
Now you should have the context and tools to assist you in learning assembly by writing C. I will be publishing more articles on understanding calling conventions, structures, loops and more, so stay tuned!
]]>I started learning assembly a few weeks ago. Assembly?? Yeah. Assembly.
I've been slinging code for at least a decade. I've written C, Ruby, Basic, Perl, Lisp, Scheme, Haskell, Node/Javascript, Go, Lua and Erlang. I would consider myself proficient in half of those, competent in the other half. I believe strongly in higher-level languages. If you can express what you want to do in as few constructs as possible, why use a less powerful language?
But I always find myself drawn to assembly.
I think the reason lies in the fact that machine code is inescapable. Everything ends up as machine code, because its the one and only language the machine speaks. C is arguably a thin wrapper around assembly, abstracting memory architecture and calling conventions. Interpreted languages like Javascript run through a virtual machine that is itself written in, ultimately, machine code. Though the Perl programmer doesn't sweat memory management, the machine does.
After reading Paul Graham's Beating the Averages, I took up his unwritten challenge to learn and use Lisp. Lisp is a beautifully simple language, and its elegance affords the programmer ultimate flexibility, with a little work. Compared to Lisp, assembly is a constrained, weak language, confined to the few operations the machine understands. Put a bunch of bits here. Put some more bits over there. Check this bit, set that bit. Bits, bits, bits, bits. There are no lists, strings, or objects. Math is confined to the arbitrary size limits of the processor.
And yet a Lisp interpreter is ultimately expressed in terms of these primitives. Assembly is the base. Everything else is just an abstraction.
I didn't set out to learn assembly for the usual reason of performance. What caused me to take the plunge and dive into assembly was language design.
For a personal project, I began writing a domain-specific embedded scripting language. Let's call it Q. Under the hood, Q expressions are implemented as binary parse trees, like most other languages. The parser builds up these trees in memory, and the run-time walks the tree and executes actions for each op-code.
Q functions are expressed as self-contained parse trees that are evaluated for every function call. Named functions are just parse trees linked to one or more symbol tables. Scope is implemented by keeping a stack of symbol tables, one per scope.
What started out as a niche language with specific requirements morphed into a general purpose language. I blame Lisp for this. With a little bit of work and a couple dozen more parentheses here and there, Q could become a Lisp dialect.
Yawn. Boring. How many Lisp-y languages are out there? Ten? Twenty? More? Don't get me wrong, Lisp is awesome. But like other high-level languages (I'm looking at you, Perl/Ruby/Python), it requires an interpreter. Someone wrote some kick-ass program in Erlang? Gotta install the interpreter and standard libraries. Python? Get the interpreter. Ruby? You guessed it, you need an interpreter.
Wouldn't it be awesome to create a language (Yes, yes. I am aware that Go has a unique ability to compile down to machine code.) with the elegance of Lisp and the speed of raw machine code?
Enter assembly.
I figure if I know assembly well enough, translating Q parse trees and symbol tables into machine code should be fairly trivial. Patterns tend to bubble up from the most foundational levels of computer science. Q's design follows the Von Neumann architecture, which maps cleanly to real hardware architectures, and by extension, assembly.
Underlying all of this is my personal drive to learn. Learn anything. Learn everything. It's why I got involved in computers in the first place. And although my logical reason for wanting to learn assembly are shaky at best (do we really need another Lisp? Probably not) the illogical reason is good enough for me.
Why do you pursue your passion?
]]>