I've been spending a lot of time working on the next iteration of the proof-of-concept Tweed data services broker. In the version I demonstrated back in December, all the heavy lifting was done by little bits of code + configuration called stencils. All the logic to deploy a Redis instance, for example, lived in a directory called
redis/standalone, for instance.
This works great in theory -- whenever the broker has to do something to a Redis (standalone) instance, it can shell out and exec a small script which finds the stencil root and then executes other things, like
kubectl apply or
One of the biggest problems with this approach is all those pesky dependencies that exist outside of the stencil root: things like
jq, and even
bosh. What happens if one stencil requires a newer version of
jq, and another requires and older version?
This type of dependency management has already been solved, and quite effectively too: OCI (Docker) images.
At its core, an OCI image is nothing more than a supporting root filesystem that contains all of the binaries, libraries, and base operational configuration files that a given process needs to properly execute.
If we re-design Tweed so that each stencil is an OCI image, we get a whole bunch of benefits:
- Tool packaging is entirely up to the person putting together the stencil. If they don't want Spruce, they don't have to install it. If they'd rather do everything in Go, that's their right.
- Instance versioning becomes easier. When Tweed provisions an instance, it can pin that instance to the unique identifier for the stencil OCI image that was used to deploy it. Then, future reconfiguration and re-provisioning uses that same image. This allows operators to upgrade the catalog without having to consider the impact of newer stencils on older instances. (This is a problem we struggle with regularly with Blacksmith).
- Building stencils relies on all the same tools we use for all of our other container-based efforts: things like
docker build, dive, and more are all available to us intrinsically.
- Testing is way easier. A
docker runwith the appropriate mounts, standard input, and environment variables set is all you need to exercise all the parts of a stencil. This can be formalized, tossed into a CI/CD solution (like GitHub Actions), and become a regular part of your test/build/release workflow.
There's only one downside to using OCI images for parts of the application logic: carrying around a Docker daemon with you, everywhere Tweed goes.
The Tweed demo can be deployed to a Kubernetes cluster, today. Given a Service Account with appropriate cluster rights, it can then effortlessly turn around and start deploying new service instances to that cluster. We would like very much to retain that power / flexibility.
What we need is a way to run OCI images without a Docker daemon.
What we need is
runcdeals with Linux cgroups do you, and is at its core a Linux
containerization technology. As such, you'll only be able to use runc from
a proper Linux box. I haven't yet gotten around to testing this stuff on
Windows Subsystem for Linux (WSL). For best results, use something like
Vagrant if you don't already natively run on Linux.
The easiest way to install
runc is to grab the latest static binary from their GitHub Releases Page, then
chmod it to be executable (
0755 works just fine) and pop it in a directory in your
$ curl -sLo ~/bin/runc https://github.com/opencontainers/runc/releases/download/v1.0.0-rc9/runc.amd64 $ chmod 0755 ~/bin/runc $ runc -v runc version 1.0.0-rc9 spec: 1.0.1-dev
runc for a Whirl
Let's start with a simple container:
alpine:3. With Docker, we can do things like this:
$ docker run alpine:3 ps -ef PID USER TIME COMMAND 1 root 0:00 ps -ef
We want to figure out how to do that in
runc, and dispense with the orchestrating Docker daemon.
runc operates on OCI images it doesn't work with Docker registries. That means we can't
docker pull, and we need to have the OCI image bits locally accessible to us, for
runc to be able to spin up the container.
This is where you'll start to notice differences between Docker-isms and
docker pullis a Docker-ism.
docker execis a
We could craft our own image bits (called a rootfs, in
runc parlance), but that's tedious and error-prone. Besides, we have a perfectly good image sitting inside of our Docker daemon; why not just extract the
alpine:3 image and use it?
$ docker run -d --name alpine-for-runc alpine:3 «some hard-to-remember container uuid» $ mkdir rootfs $ docker export alpine-for-runc | tar -x -C ./rootfs $ ls -l ./rootfs total 0 drwxr-xr-x 84 jhunt staff 2688 Dec 24 10:04 bin drwxr-xr-x 5 jhunt staff 160 Feb 25 08:26 dev drwxr-xr-x 36 jhunt staff 1152 Feb 25 08:26 etc ... etc. ...
Docker doesn't let us extract image filesystems directly; it only allows us to export a container's filesystem. This turns out to be less of a blocker than it initially appears -- all we do is
docker run the image to get a container, and then export that. When we do so, we can pipe it directly into
-C flag will change directory into the new
rootfs/ directory before it starts extracting stuff. We'll give this directory to
runc when we start spooling up containers.
Now that we have the salient bits of the OCI image, it's time to start setting up execution parameters.
runc does this by way of a JSON configuration file, imaginatively named
config.json. You can write these by hand, or you can get one out of
runc spec and modify it to your liking.
We'll do the latter:
$ runc spec $ ls -l config.json $ sed -ie 's/"terminal": true/"terminal": false/' config.json $ sed -ie 's/"sh"/"ps","-ef"/' config.json
Those last two
sed lines are a bit cryptic, but they boil down to a few simple, heuristic changes we need to make to the generated spec. By default,
runc spec creates a configuration that requires a terminal. We won't need an interactive pseudo-TTY device, so we can (and do) turn that off. We also want to swap out the default command
sh for our
To run this, we'll
sudo as root and give it a whirl:
$ sudo runc run foo-$(date +%Y%m%d%H%M%S) PID USER TIME COMMAND 1 root 0:00 ps -ef
Yay! It worked!
It's important to note here that the command to run inside of the container is specified in the
config.json file, and nowhere else. We don't specify it on the command-line for
runc, specifically. This means that whenever we want to change commands, we need a different config.json. Tools like Spruce and jq can be extra handy here.
Now we've got OCI images executing as containers, without needing a Docker daemon. Go ahead and shut down your Docker daemon and try the above command.
The last thing we want to do is drop that
sudo requirement. Ideally, we'd like to be able to run our containers as ourselves (or as whatever account Tweed is going to execute as) without relying on privilege or specific Linux capabilities.
runc has a concept called Rootless Containers whereby regular users like you and I are allowed to set up their own cgroup namespaces. This relies on some fiddly in-kernel configuration bits, but most modern Linux distributions should have the
CONFIGUSERNS bit twiddled to the "yes" position. To find out (at least on my Ubuntu system):
$ grep CONFIG_USER_NS /boot/config-$(uname -r) CONFIG_USER_NS=y
The difference between rootless containers and other (rootfull?) containers lies in their
config.json specs. To generate a rootless spec, use the
$ runc spec --rootless
Note: you'll have to remove or rename your existing config.json spec; runc will (thankfully) refuse to overwrite it.
If you're curious, here's the semantic difference between the rootless spec and the one we started with (my UID/GID is 1000/1000):
$ spruce diff root-config.json config.json linux - one map entry removed: + two map entries added: resources: gidMappings: │ devices: - containerID: 0 │ - access: rwm │ size: 1 │ │ allow: false │ hostID: 1000 uidMappings: - containerID: 0 │ size: 1 │ hostID: 1000 linux.namespaces - one list entry removed: + one list entry added: - type: network - type: user mounts - one list entry removed: - type: cgroup │ source: cgroup │ destination: /sys/fs/cgroup │ options: │ - nosuid │ - noexec │ - nodev │ - relatime │ - ro mounts./dev/pts.options - one list entry removed: - gid=5 mounts./sys.options + one list entry added: - rbind mounts./sys.source ± value change - sysfs + /sys mounts./sys.type ± value change - sysfs + none
With this new spec in place (don't forget to run the same
sed commands we had earlier for dropping the terminal requirement and changing out the command to run), we can now run our Docker-less container as ourselves:
$ runc run foo PID USER TIME COMMAND 1 root 0:00 ps -ef
Where to From Here?
There are several different ways you can go with this new technique.
For Tweed, I'll be investigating what it takes to modify the current lifecycle shell scripts such that they execute into a specially-crafted OCI image to do their work. This will most likely result in a small-ish command-line tool that shoulders some of the burden of configuring container specs away from the caller.
For other projects, both current and future, you can use this technique for deferring some part of the computation set before you to external sources. For example, an image management API might have pluggable backends, controllable by the operators, that are implemented in terms of conversion processes inside of OCI images. Want to support WebP? Just code up a new Docker container, export it, and drop it on the API implementation. You can even use registries like Docker Hub as a distribution mechanism -- push to Docker Hub, pull from Docker Hub on a re-packaging box, export, then upload!
Having the ability to just execute OCI images at-will opens up a whole new world of cloud-native and container-native application design. I hope you find a use for it, and hit me up on Twitter if you do!