https://new.jameshunt.us

BOSH-lite/warden on 10.x.x.x

UPDATE: Geoff helpfully pointed out this morning that you can set the Warden netmask via the bare-metal.yml.example file. But that's no fun.


I recently spun up BOSH-lite on my lab server, thanks to the vagrant tweaks that Geoff Franks made against Ruben Koster's bare-metal-bosh-lite code.

With BOSH spinning, I put together a small testing BOSH release (which you can find here), and deployed it. The deployment worked fine, and soon I had some test VMs running as Warden containers on the server.

Problem is, the networking was all mismatched.

Anyone familiar with BOSH-lite is accustomed to the 10.244.0.0/16 network space that it uses by default. This convention over configuration approach has lead to lots of BOSH releases shipping a Warden configuration out-of-the-box that spins up static IPs in the 10.244 network. No problem; BOSH-lite is intended to be used for development.

For reasons I'll not go into right now (maybe a future blog post), I want to be able to run several bare-metal BOSH servers using Warden containers. I don't want to invest the money in vSphere, and I don't have the patience to stand up Openstack. That leaves AWS, which is too expensive for my ambitions, and a bit overkill.

Which is why I turned to BOSH + vagrant in the first place.

The problem I ran into had to do with routing. Here's a highly simplified view of the network topology inside of my lab:

Network diagram showing the decomposition of a 10.0.0.0/8 IPv4 network into a services network and an untrusted wireless clients network.

All wireless clients (laptops and phones included) live on the Unprivileged Access Network (the pink cloud), which routes everything to the Core Network (green) via a consumer-grade wireless router at 10.0.1.1.

The core network router is a Linksys WRT54G with a custom build of OpenWRT that allows it to manage routing, firewall duty, and termination of a handful of OpenVPN point-to-point VPNs that stitch my network into a larger one distributed across the Internet.

The important part of the diagram, however, is the 10.4.0.0/16 (yellow) network labeled "Service Network 1". This network exists entirely inside of the beefy server hardware that I am running BOSH-lite on. The Core Network OpenWRT router has static routes for the 10.4/16 subnet.

The Easy Solution

The easy solution, of course, is to just change the Service Network range from 10.4/16 to 10.244/16. Problem solved. Off to the deployments!

Unfortunately, that masks a not-so-subtle problem with the default configuration of BOSH-lite (which, as a develpoment platform, is not important enough to warrant discussion): you can't run more than one!

The Hacker Solution

Just for fun, let's try changing the BOSH deployment manifest to use networks in 10.4/16, and see if it works!

jumpbox $ ./test-dev manifest warden
jumpbox $ sed -i -e 's/10.244/10.4/' manifests/test-warden-manifest.yml
jumpbox $ bosh -n deploy
Acting as user 'admin' on deployment 'test-dev' on 'Bosh Lite Director'
Getting deployment properties from director...

Deploying
---------

Director task 59
  Started unknown
  Started unknown > Binding deployment. Done (00:00:00)

  Started preparing deployment
  Started preparing deployment > Binding releases. Done (00:00:00)
  Started preparing deployment > Binding existing deployment. Done (00:00:00)
  Started preparing deployment > Binding resource pools. Done (00:00:00)
  Started preparing deployment > Binding stemcells. Done (00:00:00)
  Started preparing deployment > Binding templates. Done (00:00:00)
  Started preparing deployment > Binding properties. Done (00:00:00)
  Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
  Started preparing deployment > Binding instance networks. Done (00:00:00)

  Started preparing package compilation > Finding packages to compile. Done (00:00:00)

  Started preparing dns > Binding DNS. Done (00:00:00)

  Started creating bound missing vms > small_z1/0. Failed: Creating VM with agent ID \
    'aacf0752-d306-4788-b6ee-aabbfc338d4b': Creating container: network already acquired: \
    10.4.2.8/30 (00:00:01)

Error 100: Creating VM with agent ID 'aacf0752-d306-4788-b6ee-aabbfc338d4b': \
  Creating container: network already acquired: 10.4.2.8/30

Task 59 error

For a more detailed error report, run: bosh task 59 --debug

You can try any number of different networks, but if they aren't in 10.244/16, BOSH (or more specifically, Warden) will fail to provision the network, claiming that it is "already acquired".

Luckily, while the decision to use 10.244/16 is hard-coded into the BOSH-lite/warden distribution, it is explicitly called out in exactly one place: the startup script that runs the Warden supervisor.

boshbox # grep -nC6 10.244.0.0 /var/vcap/jobs/warden/bin/warden_ctl
29-    exec /var/vcap/packages/warden-linux/bin/warden-linux \
30-      -disableQuotas=true \
31-      -listenNetwork=tcp \
32-      -listenAddr=0.0.0.0:7777 \
33-      -denyNetworks= \
34-      -allowNetworks= \
35:      -networkPool=10.244.0.0/16 \
36-      -depot=/var/vcap/data/warden/depot \
37-      -rootfs=/var/vcap/packages/rootfs_lucid64 \
38-      -overlays=/var/vcap/data/warden/overlays \
39-      -bin=/var/vcap/packages/warden-linux/src/github.com/cloudfoundry-incubator/warden-linux/linux_backend/bin \
40-      -containerGraceTime=5m \
41-      1>>$LOG_DIR/warden.stdout.log \

If you change line 35 to specify the -networkPool option as 10.0.0.0/8, and subsequently restart the Warden supervisor, you can provision against any subnet of 10/8, and even use different nets on different boxen.

boshbox # sed -i -e 's@10.244.0.0/16@10.0.0.0/8@' \
          /var/vcap/jobs/warden/bin/warden_ctl
boshbox # monit restart warden

Happy Hacking!

James (@iamjameshunt) works on the Internet, spends his weekends developing new and interesting bits of software and his nights trying to make sense of research papers.

Currently exploring Kubernetes, as both a floor wax and a dessert topping.