I've been spending a lot of time working on the next iteration of the proof-of-concept Tweed data services broker. In the version I demonstrated back in December, all the heavy lifting was done by little bits of code + configuration called stencils. All the logic to deploy a Redis instance, for example, lived in a directory called redis/standalone
, for instance.
This works great in theory -- whenever the broker has to do something to a Redis (standalone) instance, it can shell out and exec a small script which finds the stencil root and then executes other things, like kubectl apply
or bosh deploy
.
One of the biggest problems with this approach is all those pesky dependencies that exist outside of the stencil root: things like spruce
, jq
, and even kubectl
and bosh
. What happens if one stencil requires a newer version of jq
, and another requires and older version?
This type of dependency management has already been solved, and quite effectively too: OCI (Docker) images.
At its core, an OCI image is nothing more than a supporting root filesystem that contains all of the binaries, libraries, and base operational configuration files that a given process needs to properly execute.
If we re-design Tweed so that each stencil is an OCI image, we get a whole bunch of benefits:
- Tool packaging is entirely up to the person putting together the stencil. If they don't want Spruce, they don't have to install it. If they'd rather do everything in Go, that's their right.
- Instance versioning becomes easier. When Tweed provisions an instance, it can pin that instance to the unique identifier for the stencil OCI image that was used to deploy it. Then, future reconfiguration and re-provisioning uses that same image. This allows operators to upgrade the catalog without having to consider the impact of newer stencils on older instances. (This is a problem we struggle with regularly with Blacksmith).
- Building stencils relies on all the same tools we use for all of our other container-based efforts: things like
docker build
, dive, and more are all available to us intrinsically. - Testing is way easier. A
docker run
with the appropriate mounts, standard input, and environment variables set is all you need to exercise all the parts of a stencil. This can be formalized, tossed into a CI/CD solution (like GitHub Actions), and become a regular part of your test/build/release workflow.
There's only one downside to using OCI images for parts of the application logic: carrying around a Docker daemon with you, everywhere Tweed goes.
The Tweed demo can be deployed to a Kubernetes cluster, today. Given a Service Account with appropriate cluster rights, it can then effortlessly turn around and start deploying new service instances to that cluster. We would like very much to retain that power / flexibility.
What we need is a way to run OCI images without a Docker daemon.
What we need is runc
.
Getting runc
Installed
Note:runc
deals with Linux cgroups do you, and is at its core a Linux
containerization technology. As such, you'll only be able to use runc from
a proper Linux box. I haven't yet gotten around to testing this stuff on
Windows Subsystem for Linux (WSL). For best results, use something like
Vagrant if you don't already natively run on Linux.
The easiest way to install runc
is to grab the latest static binary from their GitHub Releases Page, then chmod
it to be executable (0755
works just fine) and pop it in a directory in your $PATH
:
$ curl -sLo ~/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc9/runc.amd64
$ chmod 0755 ~/bin/runc
$ runc -v
runc version 1.0.0-rc9
spec: 1.0.1-dev
Taking runc
for a Whirl
Let's start with a simple container: alpine:3
. With Docker, we can do things like this:
$ docker run alpine:3 ps -ef
PID USER TIME COMMAND
1 root 0:00 ps -ef
We want to figure out how to do that in runc
, and dispense with the orchestrating Docker daemon.
While runc
operates on OCI images it doesn't work with Docker registries. That means we can't docker pull
, and we need to have the OCI image bits locally accessible to us, for runc
to be able to spin up the container.
This is where you'll start to notice differences between Docker-isms and
container-isms.docker pull
is a Docker-ism.docker exec
is a
container-ism.
We could craft our own image bits (called a rootfs, in runc
parlance), but that's tedious and error-prone. Besides, we have a perfectly good image sitting inside of our Docker daemon; why not just extract the alpine:3
image and use it?
$ docker run -d --name alpine-for-runc alpine:3
«some hard-to-remember container uuid»
$ mkdir rootfs
$ docker export alpine-for-runc | tar -x -C ./rootfs
$ ls -l ./rootfs
total 0
drwxr-xr-x 84 jhunt staff 2688 Dec 24 10:04 bin
drwxr-xr-x 5 jhunt staff 160 Feb 25 08:26 dev
drwxr-xr-x 36 jhunt staff 1152 Feb 25 08:26 etc
... etc. ...
Docker doesn't let us extract image filesystems directly; it only allows us to export a container's filesystem. This turns out to be less of a blocker than it initially appears -- all we do is docker run
the image to get a container, and then export that. When we do so, we can pipe it directly into tar
. The -C
flag will change directory into the new rootfs/
directory before it starts extracting stuff. We'll give this directory to runc
when we start spooling up containers.
Now that we have the salient bits of the OCI image, it's time to start setting up execution parameters. runc
does this by way of a JSON configuration file, imaginatively named config.json
. You can write these by hand, or you can get one out of runc spec
and modify it to your liking.
We'll do the latter:
$ runc spec
$ ls -l config.json
$ sed -ie 's/"terminal": true/"terminal": false/' config.json
$ sed -ie 's/"sh"/"ps","-ef"/' config.json
Those last two sed
lines are a bit cryptic, but they boil down to a few simple, heuristic changes we need to make to the generated spec. By default, runc spec
creates a configuration that requires a terminal. We won't need an interactive pseudo-TTY device, so we can (and do) turn that off. We also want to swap out the default command sh
for our ps -ef
.
To run this, we'll sudo
as root and give it a whirl:
$ sudo runc run foo-$(date +%Y%m%d%H%M%S)
PID USER TIME COMMAND
1 root 0:00 ps -ef
Yay! It worked!
It's important to note here that the command to run inside of the container is specified in the config.json
file, and nowhere else. We don't specify it on the command-line for runc
, specifically. This means that whenever we want to change commands, we need a different config.json. Tools like Spruce and jq can be extra handy here.
Now we've got OCI images executing as containers, without needing a Docker daemon. Go ahead and shut down your Docker daemon and try the above command.
Rootless Containers
The last thing we want to do is drop that sudo
requirement. Ideally, we'd like to be able to run our containers as ourselves (or as whatever account Tweed is going to execute as) without relying on privilege or specific Linux capabilities.
runc
has a concept called Rootless Containers whereby regular users like you and I are allowed to set up their own cgroup namespaces. This relies on some fiddly in-kernel configuration bits, but most modern Linux distributions should have the CONFIGUSERNS
bit twiddled to the "yes" position. To find out (at least on my Ubuntu system):
$ grep CONFIG_USER_NS /boot/config-$(uname -r)
CONFIG_USER_NS=y
The difference between rootless containers and other (rootfull?) containers lies in their config.json
specs. To generate a rootless spec, use the --rootless
flag:
$ runc spec --rootless
Note: you'll have to remove or rename your existing config.json spec; runc will (thankfully) refuse to overwrite it.
If you're curious, here's the semantic difference between the rootless spec and the one we started with (my UID/GID is 1000/1000):
$ spruce diff root-config.json config.json
linux
- one map entry removed: + two map entries added:
resources: gidMappings:
│ devices: - containerID: 0
│ - access: rwm │ size: 1
│ │ allow: false │ hostID: 1000
uidMappings:
- containerID: 0
│ size: 1
│ hostID: 1000
linux.namespaces
- one list entry removed: + one list entry added:
- type: network - type: user
mounts
- one list entry removed:
- type: cgroup
│ source: cgroup
│ destination: /sys/fs/cgroup
│ options:
│ - nosuid
│ - noexec
│ - nodev
│ - relatime
│ - ro
mounts./dev/pts.options
- one list entry removed:
- gid=5
mounts./sys.options
+ one list entry added:
- rbind
mounts./sys.source
± value change
- sysfs
+ /sys
mounts./sys.type
± value change
- sysfs
+ none
With this new spec in place (don't forget to run the same sed
commands we had earlier for dropping the terminal requirement and changing out the command to run), we can now run our Docker-less container as ourselves:
$ runc run foo
PID USER TIME COMMAND
1 root 0:00 ps -ef
Success!
Where to From Here?
There are several different ways you can go with this new technique.
For Tweed, I'll be investigating what it takes to modify the current lifecycle shell scripts such that they execute into a specially-crafted OCI image to do their work. This will most likely result in a small-ish command-line tool that shoulders some of the burden of configuring container specs away from the caller.
For other projects, both current and future, you can use this technique for deferring some part of the computation set before you to external sources. For example, an image management API might have pluggable backends, controllable by the operators, that are implemented in terms of conversion processes inside of OCI images. Want to support WebP? Just code up a new Docker container, export it, and drop it on the API implementation. You can even use registries like Docker Hub as a distribution mechanism -- push to Docker Hub, pull from Docker Hub on a re-packaging box, export, then upload!
Having the ability to just execute OCI images at-will opens up a whole new world of cloud-native and container-native application design. I hope you find a use for it, and hit me up on Twitter if you do!
Happy Hacking!